@webscidl phd student project reviews august 5&6, 2015
TRANSCRIPT
![Page 1: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/1.jpg)
Web$Science$and$Digital$Libraries$Research$Group$$
@WebSciDL$
Review$of$Projects$for$$Herbert$Van$de$Sompel,$LANL$
August$5&6,$2015$$
![Page 2: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/2.jpg)
Corren McCoy
Disambiguation of Alumni from Publicly Available Social Media Profiles
Presentation for Herbert Van de Sompel 08/05/2015
![Page 3: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/3.jpg)
Let’s be Social!
Directory Search Name: Michael Nelson College: Old Dominion Degree: Computer Science Year: 1997
2
![Page 4: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/4.jpg)
Motivation
Maintain relationships with alumni
Interact and re-engage
3
Pew Research Survey, Sept. 2014 LinkedIn is used by 28% of online adults. 23% are between 18-29*
Twitter is used by 23% of online adults. 37% are between 18-29
*Pew Research Center noted a significant change in this percentage from 2013
![Page 5: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/5.jpg)
Research Goals • Given discrete set of attributes • Leverage public information
• Collect structured/unstructured metadata • Develop a probabilistic matching scheme
• Analyze and discover new profile attributes • Connect the networks
4
![Page 6: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/6.jpg)
Seminal Works
• Mislove, A., Viswanath, B., Gummadi, K. P., & Druschel, P. (2010, February). You are who you know: inferring user profiles in online social networks. In Proceedings of the third ACM international conference on Web search and data mining (pp. 251-260). ACM.
• Northern, C. T., & Nelson, M. L. (2011). An unsupervised approach to discovering and disambiguating social media profiles. In Proceedings of Mining Data Semantics Workshop.
• Powell, J., Shankar, H., Rodriguez, M., & Van de Sompel, H. (2014). EgoSystem: Where are our Alumni?. Code4Lib Journal, (24).
5
![Page 7: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/7.jpg)
Our Work is Informed
Attribute inference based on a Facebook crawl of a known friends network with matching to a Student or Alumni Directory. Examination of digital preservation strategies across social media sites using feature data to score and disambiguate the discovered profiles. Aggregation of discovered social and institutional artifacts to a public identity which are linked in a property graph to facilitate searching.
Mislove
Northern
Powell 6
![Page 8: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/8.jpg)
Similarity Metrics
![Page 9: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/9.jpg)
Does it help to know a name?
Census Surnames Social Security Administration
Name Ranking as of 2014 Michael 7 Nelson 40
Michele ----- Weigle 13,604
First names 19,584 Surnames 150,436
8
![Page 10: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/10.jpg)
Are Vanity Screen Names Re-used?
LinkedIn: michaellloydnelson Twitter: phonedude_mln
9
![Page 11: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/11.jpg)
Is the Affiliation Repeated?
LinkedIn: Old Dominion University Twitter: Old Dominion University mentioned in bio but could be a false positive
10
![Page 12: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/12.jpg)
How Far Apart in Space?
LinkedIn: Norfolk, Virginia Area Twitter: Norfolk, VA
11
![Page 13: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/13.jpg)
Do People Re-use Profile Photos?
TinEye Reverse Image Search
12
![Page 14: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/14.jpg)
Do Web Links Point to the Same Page?
LinkedIn: http://www.cs.odu.edu/~mln/ http://ws-dl.blogspot.com/ http://f-measure.blogspot.com/ Twitter: cs.odu.edu/~mln/
13
![Page 15: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/15.jpg)
Community Analysis Surrogate Connections - People Also Viewed
One step from Dr. Nelson One step from Brittany Johnson
14
![Page 16: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/16.jpg)
Community Analysis Disclosed – (Followers?) and Following
15
![Page 17: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/17.jpg)
Property Graph Analysis
https://twitter.com/phonedude_mln
https://www.linkedin.com/in/michaellloydnelson
16
![Page 18: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/18.jpg)
Property Graph Analysis
https://twitter.com/phonedude_mln
https://www.linkedin.com/in/michaellloydnelson
17
Location
Norfolk, Virginia area Norfolk, VA
![Page 19: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/19.jpg)
Property Graph Analysis
https://twitter.com/phonedude_mln
https://www.linkedin.com/in/michaellloydnelson
18
Location
Norfolk, Virginia area Norfolk, VA
Affiliation Value: Old Dominion
Attended
![Page 20: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/20.jpg)
Property Graph Analysis
https://twitter.com/phonedude_mln
https://www.linkedin.com/in/michaellloydnelson
19
Geo-Location
Norfolk, Virginia area Norfolk, VA
Affiliation Value: Old Dominion
Attended
Twitter @ODUNow
hasOfficialAccount
![Page 21: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/21.jpg)
Property Graph Analysis
https://twitter.com/phonedude_mln
https://www.linkedin.com/in/michaellloydnelson
20
Geo-Location
Norfolk, Virginia area Norfolk, VA
Affiliation Value: Old Dominion
Attended
Twitter @ODUNow
hasOfficialAccount
follows
![Page 22: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/22.jpg)
Example Searches
![Page 23: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/23.jpg)
LinkedIn Candidate Search
• Leverage Google’s advanced search operators to improve precision.
• Trusted information from the Registrar’s Office.
22
![Page 24: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/24.jpg)
LinkedIn Metadata How Prevalent are Nicknames?
Name Michael Nelson Mike Nelson Mike Nelson
Headline Professor at Old Dominion University Orthotist / Certified Athletic Trainer Driver at Old Dominion Freight Line
Location Norfolk, Virginia Area Providence, Rhode Island Area Phoenix, Arizona
URL https://www.linkedin.com/in/michaellloydnelson
https://www.linkedin.com/in/mikenelson64
https://www.linkedin.com/pub/mike-nelson/6b/50b/879
Profile Photo https://media.licdn.com/mpr/mpr/shrinknp_400_400/p/1/000/019/1d1/39275de.jpg
https://media.licdn.com/mpr/mpr/shrinknp_400_400/p/2/000/02f/11d/3f17849.jpg
-----
Vanity Screen Name michaellloydnelson mikenelson64
Industry Research Hospital & Health Care Transportation/Trucking/Railroad
Websites http://www.cs.odu.edu/~mln/ http://ws-dl.blogspot.com/ http://f-measure.blogspot.com/
----- ----
Affiliation(s) Old Dominion University, 1997-2000 Old Dominion University, 1996-1997 Virginia Polytechnic Institute and State University, 1987-1991
Old Dominion University, 1999-2001 -----
23
![Page 25: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/25.jpg)
Twitter Candidate Search
24
![Page 26: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/26.jpg)
Twitter Metadata Given and Nickname Search
User Name Michael L. Nelson Mike Nelson Mike Nelson
Bio
Head of @WebSciDL, Computer Science, Old Dominion University; Formerly: @NASA_Langley (1991-2002), @UNCSILS (2000-2001); OAI-PMH OAI-ORE Memento ResourceSync
----- -----
Location Norfolk, VA ----- -----
URL https://twitter.com/phonedude_mln https://twitter.com/mikenelson64
-----
Profile Photo https://pbs.twimg.com/profile_images/959295176/mln-ad-100x130_400x400.jpg
----- -----
Screen Name Phonedude_mln mikenelson64
Industry ----- ----- -----
Websites cs.odu.edu/~mln/ -----
Affiliation(s) Old Dominion University in bio. Following @ODUNow official account
----- -----
25
![Page 27: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/27.jpg)
Known Issues • Reliability of Name Searches
– Nicknames list from the Northern (2011) study is incomplete. Ignores ethnic given names.
– Given and surname data from US census and SSA must exist at a certain threshold to protect privacy.
– Naïve calculation of name probabilities. Some name combinations do not occur frequently.
• Uncovering social data is difficult – LinkedIn limits use of API to get real connections. – Rate limits on the Twitter API constrain the depth of
the followers/following search.
26
![Page 28: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/28.jpg)
Known Issues • Each network takes a different approach to
the visibility of metadata – Exploit the structure of LinkedIn – Twitter data is noisy, limited space with no
controlled vocabulary
27
![Page 29: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/29.jpg)
By: Alexander Nwala August 5, 2015
Progress Report
Presented To: Dr. Herbert Van de Sompel, Dr. Michael Nelson
![Page 30: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/30.jpg)
Progress Report
Outline• Past projects
• Refactoring Hany’s Carbon date • What Did It Look Like? • I Can Haz Memento
• Present research • Exploration of Distributed Information Retrieval
• Problem • Goal • Research paths; possibility contributions
![Page 31: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/31.jpg)
Carbon date• Estimates the creation date of a URI • The current implementation features a:
• Threaded server • Concurrent API requests • Cached responses
• This is achieved by picking the least date from these sources:
• Last modified date • Bitly • Topsy • Backlinks • Archives
Website: http://cd.cs.odu.edu Blog post: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
![Page 32: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/32.jpg)
What Did It Look Like?
• Tumblr blog which • Uses the Memento framework to poll various public web archives • Creates an animated image for each year that shows the progression of the site
through the years • Everyone is free to nominate web sites to What Did It Look Like? by tweeting:
“#whatdiditlooklike URL”
Website: http://whatdiditlooklike.mementoweb.org/ Blog post: http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html
![Page 33: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/33.jpg)
I Can Haz Memento
• Inspired by the “#icanhazpdf” movement and also built upon the Memento framework
• For tweets with links containing “#icanhazmemento” • I Can Haz Memento service replies the tweet with a link pointing to:
Website: https://twitter.com/icanhazmemento/ Blog post: http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html
Archived version of the page closest to the time of the tweet
![Page 34: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/34.jpg)
Progress Report
Outline• Past projects
• Refactoring Hany’s Carbon date • What Did It Look Like? • I Can Haz Memento
• Present research • Exploration of Distributed Information Retrieval
• Problem • Goal • Research paths; possibility contributions
![Page 35: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/35.jpg)
Problem :: Undiscoverable resources are not included in SERPs
• SERP does not have intended resource: “A kinetic theory for age-structured stochastic birth-death processes”
• But resource is available in a special collection (arXiv.org)
Case 1, SERP for Query: “stochastic birth-death processes”
Google Search
arXiv.org Search
![Page 36: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/36.jpg)
Problem :: Information not discoverable from Google do not exist to many web users
• 1st page of SERP does not have intended resource: “EPIDEMIOLOGY THROUGH CELLULAR…”
Case 2, SERP for Query: “influenza indonesia”
![Page 37: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/37.jpg)
Case 2, SERP for Query: “influenza indonesia”
Google Search
arXiv.org Search
Relevant resource on 7th page
Relevant resource on 1st page
Problem :: Inconsistent views between SERP and special collections
![Page 38: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/38.jpg)
Problem :: When to stop?
• A user potentially misses relevant information because it is NOT presented with search results OR presented too far (e.g. last 7th page)
• In other words, if relevant content is not presented in the first n pages (e.g. n < 3), it does not exist
? ? ?
![Page 39: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/39.jpg)
Goal :: Present resources from multiple unindexed sources with Google SERP
• This can be achieved through middleware such as a browser plugin
10 more relevant resources1.
2. Click
Relevant resource on 1st page
![Page 40: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/40.jpg)
Exploration of DIR :: Problem summary and Goal
• Problem • Inconsistent views between SERP and special collections
leads to absence of relevant resources in SERPs (Case 1)
• If relevant content is not presented in the first n pages (e.g. n < 3), it does not exist (Case 2)
• Goal • Present resources from multiple unindexed sources with
Google SERP
![Page 41: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/41.jpg)
Exploration of DIR :: Possible research paths
• Research Pathway 1: Understanding the search results
• Research Pathway 2: Understanding the query
• Research Pathway 3: Understanding the data source
![Page 42: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/42.jpg)
Research Pathway 1 vs Research Pathway 2
Research Pathway 2: Understanding the query
• Blindly routing every query to every data source is unacceptable
• Query understanding • Domain classification of query • Intent recognition of query • Semantic labelling of query
• Route only queries that are relevant to the data source, to the data source: e.g. a News related query to a News source, academic queries to academic sources
• State of the art targets building statistical machine learning methods to solve the query understanding problem
• Include results from data source with SERP
Research Pathway 1: Understanding search results• Blindly routing every query to every data
source is unacceptable
• Understand the search results for clues to unravel nature of query
• Are Advertisements present • Are Images present • Are pdfs types present
• Route only queries that are relevant to the data source, to the data source: e.g. a News related query to a News source, academic queries to academic sources
• State of the art doesn’t focus on search results
• Include results from data source with SERP
![Page 43: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/43.jpg)
Research Pathway 1: Find discriminative features for “non-scholarly materials domain”
Query lengthPermutation of Pages
Result count
Title match
Images present
HTML resource
News present
Google knowledge entity present
![Page 44: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/44.jpg)
Research Pathway 1: Find discriminative features for “scholarly materials domain”
Query lengthPermutation of Pages
Result count
Title subset match
PDF resources
Notable Absences• Google Knowledge
Entity • News • Ads
Notable Presence• Non HTML
resources (PDF)
![Page 45: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/45.jpg)
Research Pathway 1: What next after finding discriminative features?
• Find a dataset (Done) • NASA NTRS query log for scholarly materials domain (400,000+) • AOL 2006 query logs for non-scholarly materials domain (400,000+)
• Train a classify (Not done)• Given a query and a list of search results. Classify the query as
belonging to one of multiple classes e.g. (Scholarly material)
![Page 46: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/46.jpg)
Research Pathway 2: Heuristic for unsupervised domain classificationOriginal algorithm 1:
• Idea: Given a query and a list of search results, the important terms which co-occur across multiple search results are indicative of the domain of the query.
Query 1: Search Engine URIs List
doc2 <a, a, a, b, b.., c>
doc1
2: Generate unigram vectors, remove redundant terms
<a, c, x, y, d, d> <a, p, w, s>docn
<a, b, c> <a, c, x, y, d> <a, p, w, s>
<a, a, a, b, c, c, d, p, s, w, x, y>
3: Sort
<a, a, a> <b> <c, c> <d> <p> <s> <w> <x> <y>
4: Find clusters
Domain Set: P
![Page 47: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/47.jpg)
Original algorithm 1 Example: Possible domains for query “Lionel messi”
• (terms), 10 of 11 pages • (barcelona"., barcellona-granada, barcelon,, barcelon,
barcelona), 9 of 11 pages • (best"., best), 9 of 11 pages • (championship, champion, championship,,
champions..., champions:, championships., champions', championships, championship:, champions.", championship-winning, champions, champions".), 9 of 11 pages
• (city, city)), 9 of 11 pages • (club, club's, club's...), 9 of 11 pages • (consented, considerably, consecutively).,
consecutively,, considered, consent, consistent, conscious, consecutively"., consecutive, considers, consider), 9 of 11 pages
• (everybody, every), 9 of 11 pages • (fc, fc.), 9 of 11 pages • (football, football".), 9 of 11 pages • (game"., game".[370], game), 9 of 11 pages
Relevant domains based on human judgement
![Page 48: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/48.jpg)
Original algorithm 2: Heuristic for supervised domain classification
• Given a set of predefined domains D:
<a, a, a> <b> <c, c> <d> <p> <s> <w> <x> <y>
4: Find clusters
Domain set: P
…
max( similarity (Pi, Di) )
• Similarity • Naive hybrid similarity (Jaccard/Overlap coefficient) • Word net • Explicit Semantic Analysis
![Page 49: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/49.jpg)
Exploration of DIR :: Summary • Problem
• There exists an inconsistency between between SERP and special collections, thus many relevant resources are not included in SERPs or
• Included too late (e.g. last page)
• Goal • Present resources from multiple unindexed sources with Google
SERP which can be done through a browser plugin
• Research Pathways • Understand the search result and train a model to learn when a
query should be forwarded to a special collection • Understand the query, for example the domain, then forward
only relevant queries to their respective special collections • Include results from special collection with SERP
![Page 50: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/50.jpg)
TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES
SCOTT G. AINSWORTH OLD DOMINION UNIVERSITY
AUGUST 5, 2015 OLD DOMINION UNIVERSITY
![Page 51: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/51.jpg)
CONTENTS ■ Motivation
(Appearances can be deceiving) ■ Background ■ Temporal Coherence ■ Research ■ What’s next?
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
2
![Page 52: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/52.jpg)
MOTIVATION
TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
3
![Page 53: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/53.jpg)
APPEARANCES …
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
4 http://web.archive.org/web/20041209190926/http://www.wunderground.org/cgi-bin/findWeather/getForecast?query=50593 (now 404, but that's a different story…)
![Page 54: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/54.jpg)
… CAN BE DECEIVING
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
5
Root Memento-Datetime: 2004-12-09T19:09:26
![Page 55: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/55.jpg)
CLEAR OR CLOUDY?
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
6
![Page 56: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/56.jpg)
QUESTIONS ■ How prevalent is temporal incoherence? ■ Can Temporal Coherence be improved using ■ Multiple archives? ■ Additional memento selection heuristics?
■ How can Temporal Coherence be conveyed?
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
7
![Page 57: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/57.jpg)
BACKGROUND COMPOSITE MEMENTOS COHERENCE STATES COHERENCE PATTERNS
TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
8
![Page 58: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/58.jpg)
COMPOSITE MEMENTO
PRESENTATION STRUCTURE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
9
URI-M0
URI-M1 URI-M2 URI-Mi-1...
URI-Mi URI-Mi+1 URI-Mn...
![Page 59: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/59.jpg)
COHERENCE STATES ■ Prima Facie Coherent
Evidence that the memento existed in its archived state when the root was acquired.
■ Prima Facie Violative Evidence … did not exist ...
■ Possibly Coherent Evidence … might have existed ...
■ Probably Violative Evidence … probably did not exist ...
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
10
![Page 60: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/60.jpg)
CONSIDER THIS HTML…
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
11
<html> <img src="foo.jpeg"> </html>
![Page 61: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/61.jpg)
AND THESE RESPONSE HEADERS HTTP/1.1 200 OK Server: Tengine/2.0.3 Date: Mon, 27 Apr 2015 22:03:32 GMT Content-Type: image/jpeg Content-Length: 15632 Connection: keep-alive Memento-Datetime: Tue, 07 Feb 2006 00:58:23 GMT Link: <Memento links deleted...> X-Archive-Orig-server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 X-Archive-Orig-etag: "4978-3d10-3e4d822e" X-Archive-Orig-content-length: 15632 X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-date: Tue, 07 Feb 2006 00:58:20 GMT X-Archive-Orig-content-type: image/jpeg X-Archive-Orig-last-modified: ↩︎
Fri, 14 Feb 2003 23:56:30 GMT X-Archive-Orig-connection: close <other headers deleted>
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
12
![Page 62: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/62.jpg)
PRIMA FACIE COHERENT
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
13
Bracket Pattern: Memento-Datetime + Last-Modified
(yes, Last-Modified is sometimes wrong, but many of those cases can be detected)
![Page 63: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/63.jpg)
PRIMA FACIE COHERENT
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
14
Equal Pattern: simultaneous capture (with an optionally tunable “bubble of simultaneity”)
![Page 64: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/64.jpg)
PRIMA FACIE VIOLATIVE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
15
![Page 65: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/65.jpg)
POSSIBLY COHERENT
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
16
Closest (or only) memento captured before the root
![Page 66: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/66.jpg)
PROBABLY VIOLATIVE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
17
Closest (or only) memento captured after the root but no Last-Modified (possibly indicating a dynamically generated representations)
![Page 67: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/67.jpg)
TEMPORAL COHERENCE EMBEDDED RESOURCES REPRESENTING COHERENCE
TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
18
![Page 68: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/68.jpg)
TEMPORAL COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
19
![Page 69: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/69.jpg)
TEMPORAL COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
20
2005-05-14
01:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
![Page 70: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/70.jpg)
EMBEDDED RESOURCES Resource Memento-Datetime Delta Resource Memento-
Datetime Delta
h"p://www.cs.odu.edu. 2005205214.01:36:08. spacer.gif. 2005206201.16:23:10. 18.6.d.
mm_menu.js. 2005205223.02:39:12. 9.0.d. jimcheng.gif. 2005206201.16:37:39. 18.6.d.
style.css. 2005205223.02:39:39. 9.0.d. jsmith.gif. 2005206201.16:58:50. 18.6.d.
gfx2logo2odu2crown.gif. 2005205223.02:39:39. 9.0.d. rmenu_1st_featured_alumni.png. 2005206201.21:21:45. 18.8.d.
ddmenu_ddown.js. 2005205223.02:39:43. 9.0.d. hmenu_college_...2new.png. 2005212221.20:14:25. 7.3.mo.
university.js. 2005205223.02:39:56. 9.0.d. rmenu_1st_upcoming_news.png. 2005212221.20:15:14. 7.3.mo.
rmenu_1st_about.png. 2005206201.13:40:25. 18.5.d. rmenu_1st_upcoming_events.png. 2005212221.21:01:12. 7.3.mo.
rmenu_bo"om_229.gif. 2005206201.14:07:29. 18.5.d. lmenu_1st_resources.png. 2005212228.17:47:41. 7.5.mo.
shadow2bl.gif. 2005206201.14:55:53. 18.6.d. bullet_blue_triangle.gif. 2005212228.19:43:48. 7.5.mo.
ecsbdg.jpg. 2005206201.14:56:17. 18.6.d. logo2cs.gif. 2005212228.19:54:29. 7.5.mo.
shadow2br.gif. 2005206201.15:18:18. 18.6.d. rmenu_1st_featured_student.png. 2007206212.02:36:07. 2.1.years.
gfx2btn2go2dblue.gif. 2005206201.15:34:19. 18.6.d. shadow2b.gif. 2007206221.02:35:17. 2.1.years.
shadow2tr.gif. 2005206201.15:55:57. 18.6.d. shadow2r.gif. 404.Not.Found.
header2right1.gif. 2005206201.16:06:16. 18.6.d.
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
21
Embedded Resources 26
Mean Delta 125.9 days
Standard Deviation 207.7 days
Minimum Delta 9.0 days
Maximum Delta 2.1 years
![Page 71: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/71.jpg)
REPRESENTING COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
22
![Page 72: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/72.jpg)
REPRESENTING COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
23
![Page 73: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/73.jpg)
REPRESENTING COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
24
![Page 74: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/74.jpg)
REPRESENTING COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
25
![Page 75: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/75.jpg)
REPRESENTING COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
26
![Page 76: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/76.jpg)
THE FULL CHART
Mementos by Delta
Roo
t Mem
ento
-Dat
etim
e
-3y -1y 0 1y 2y 3y 4y 5y 6y
2013201220112010200920082007200620052004200320022001
Probably Coherent
rURI-M
Probably Violative
Prima Facie Coherent Prima Vacie Violative
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
27
2005-03-10
![Page 77: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/77.jpg)
RESEARCH DATA SET SAMPLING STATISTICS
TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
28
![Page 78: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/78.jpg)
DATA SET ■ 4,000 sample URI-Rs (JCDL’11 data set) ■ Single and Multiple Archives ■ Two Heuristics: ■ Minimum distance (current default
Wayback behavior) ■ choose closest Memento-Datetime
■ Bracket (proposed here) ■ use combination of Memento-Datetime +
Last-Modified (when available)
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
29
![Page 79: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/79.jpg)
SAMPLING & RECOMPOSITION ■ For each sample URI-R (rURI-R): ■ Download available TimeMaps ■ Download a single root Memento per
month ■ For each monthly Memento ■ Extract embedded URI-Rs (eURI-Rs) ■ Download TimeMaps for eURI-Rs ■ Download heuristically-best eURI-Ms ■ Repeat recursively
■ Run each heuristic and single-/multi-archive combination
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
30
![Page 80: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/80.jpg)
ROOT URI-R STATISTICS
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
31
Root URI-Rs archived 2,756 • 68.9% In multiple archives 1,180 • 29.5% Mean archives per URI-R 1.58 Mean mementos per URI-R 124.57
200 OK 82,425 • 93.6% 503 Service Unavailable 4,444 • 5.0% 404 Not found 583 • 0.7% 403 Forbidden 388 • 0.4% Others 214 • 0.3%
URI-M Status
Archival Data
![Page 81: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/81.jpg)
EMBEDDED URI-R STATISTICS
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
32
Embedded URI-Rs 1,623,127 per root URI-M 19.7 Embedded URI-Ms available 1,332,993 • 93.6% per root URI-M 15.1
Not archived 312,641 • 83.9% 404 Not found 44,852 • 12.0% 403 Forbidden 6,116 • 1.6% 503 Service Unavailable 5,442 • 1.5% Others 3,508 • 0.9%
URI-M Failure Reasons
Archival Data
![Page 82: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/82.jpg)
COMPOSITE MEMENTO (ROOT) COMPLETENESS & COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
33
Description MinDist Single
MinDist Multi
Bracket Single
Bracket Multi
Mean Complete 76.1% 80.2% 76.2% 80.3% Mean Missing 23.9% 19.8% 23.8% 19.7%
Completeness (and Missing)
Description MinDist Single
MinDist Multi
Bracket Single
Bracket Multi
Mean Prima Facie Coherent 41.0% 40.9% 54.7% 54.6% Mean Possibly Coherent 27.3% 28.7% 12.8% 14.2% Mean Probably Violative 2.5% 5.3% 2.5% 5.3% Mean Prima Facie Violative 5.3% 5.3% 6.2% 6.2%
Coherence
At least 5% of pages can be shown to have temporal violations!
Multiple archives: +completeness, -coherence?
![Page 83: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/83.jpg)
EMBEDDED MEMENTO COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
34
Description MinDist Single
MinDist Multi
Bracket Single
Bracket Multi
Prima Facie Coherent 622,565 621,447 864,736 859,625 Possibly Coherent 497,405 466,046 244,104 215,585 Probably Violative 104,376 53,734 104,339 53,694 Prima Facie Violative 100,760 103,662 114,062 117,469
Totals 1,325,106 1,244,889 1,327,241 1,246,373
Description MinDist Single
MinDist Multi
Bracket Single
Bracket Multi
Prima Facie Coherent 47.0% 49.9% 65.2% 69.0% Possibly Coherent 37.5% 37.4% 18.4% 17.3% Probably Violative 7.9% 4.3% 7.9% 4.3% Prima Facie Violative 7.6% 8.3% 8.6% 9.4%
At least 7% of embedded resources are used violatively!
![Page 84: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/84.jpg)
WHAT’S NEXT? EQUALITY & SIMILARITY MINOR & MAJOR VIOLATIONS POLICIES & HEURISTICS CONVEYING COHERENCE
TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
35
![Page 85: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/85.jpg)
EQUALITY & SIMILARITY
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
36
Equality and similarity allow prima facie coherence without Last-Modified
Early results: equality yields < 2% improvement
![Page 86: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/86.jpg)
MINOR OR MAJOR VIOLATIONS? ■ This is a temporal violation. But is it
meaningful?
■ How to judge? ■ Most archives transform HTML ■ Few support export of original file
■ How to measure similarity on binary files?
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
37
![Page 87: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/87.jpg)
POLICY & HEURISTIC TRADEOFFS ■ Speed: minimize distance ■ Completeness: query all archives
(not just top k) ■ Accuracy: maximize coherence
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
38
![Page 88: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/88.jpg)
CONVEYING COHERENCE
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
39
How to scale to > 100 embedded mementos?
How to convey coherence & contributing archive?
![Page 89: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/89.jpg)
WHAT’S NEXT SUMMARY ■ Equality & Similarity ■ Significance of violation (major? minor?) ■ Policies & Heuristics ■ Conveying Coherence
8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit
40
![Page 90: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/90.jpg)
Progress Report Lulwah Alkwai
Presented to: Dr. Herbert Van de Sompel
1
![Page 91: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/91.jpg)
Previous Work
JCDL 2015 Paper: “How Well Are Arabic Websites Archived?” Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle
We won “Best Student Paper Award”
2
2
![Page 92: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/92.jpg)
English sports websites are more archived than Arabic
www.espn.go.com www.kooora.com
3
![Page 93: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/93.jpg)
GeoIP only ccTLD only
Both Neither
! News: alarabiya.net ! ccTLD: Not Arabic (.net) ! GeoIP: Not Arabic country (US)
! E-Marketing: haraj.com.sa ! ccTLD: Arabic (.sa) ! GeoIP: Not an Arabic country (Ireland)
! News: al-watan.com ! ccTLD: Not Arabic (.com) ! GeoIP: Arabic country (Qatar)
! Educational: uoh.edu.sa ! ccTLD: Arabic (.sa) ! GeoIP: Arabic country (SA)
How do we classify Arabic websites? 4
![Page 94: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/94.jpg)
Selecting seed URIs Name Registered Year URI count
DMOZ US 1999 Dmoz.org/world/arabic 4,086 Raddadi Saudi Arabia 2000 Raddadi.com 3,271 Star28 Lebanon 2004 Star28.com 8,386 Total 15,743
• 15,092 unique seed URIs • 11,014 URIs that existed in the live web
5
![Page 95: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/95.jpg)
~41% ~38%
~36% ~39%
872
~8%
Language test intersection testing for Arabic language
6
![Page 96: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/96.jpg)
Total Arabic URIs Dataset = (7,976+292,670) = 300,646
Crawling Arabic seed URIs 7
![Page 97: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/97.jpg)
Findings Our Arabic language dataset was not largely located in Arabic countries
" Only 14.84% had an Arabic ccTLD " Only 10.53% had a GeoIP in an Arabic country " Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in
the top 10 Arabic webpages are not particularly well archived or indexed
" 46% were not archived " 31% were not indexed by Google
An Arabic webpage is more likely to be... " indexed if it is present in a directory " archived if it is present in DMOZ " archived if it has neither Arabic GeoIP nor Arabic ccTLD
For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ
8
![Page 98: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/98.jpg)
Youssef Eldakar Bibliotheca Alexandrina
" Since 2011, the BA crawls have focused on Egyptian content
" Seeds are manually selected " Future plans are to cover content related to the Arab
world 9
9
Bibliotheca Alexandrina
![Page 99: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/99.jpg)
Current Work Replacements for missing images
Goal: Make contribution by finding missing images through context and discover the replacement for the image Example:
10
![Page 100: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/100.jpg)
Motivation " D-Lib Magazine, Jan 2005:
“Transparent Format Migration of Preserved Web Content” David S. H. Rosenthal, Thomas Lipkis, Thomas S. Robertson, and Seth Morabito
" The main idea was to change a file format that is no longer understandable to a new format without changing the URI
" Can this be done for images with 404 responses? " We can define a new response code, location header
e.g. “210 Not Quite OK, But Close”
11
![Page 101: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/101.jpg)
Sample log query 0.36.125.141)web.archive.org)5)[01/Jan/2011:01:30:58)+0000])"GET)hBp://web.archive.org/web/20110101013058/hBp://www.slaverymuseum.org/IraAtTable.jpeg)HTTP/1.1")404)2135)"hBp://web.archive.org/web/20030413174118/www.slaverymuseum.org/home.htm")"Mozilla/5.0)(Windows;)U;)Windows)NT)5.1;)en5US))AppleWebKit/534.10)(KHTML,)like)Gecko))Chrome/8.0.552.224)Safari/534.10")TCP_MISS:SOURCEHASH_PARENT/207.241.227.95)205)
12
![Page 102: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/102.jpg)
Check full URI in the IA
>"curl"'I"http://web.archive.org/web/20110101013058/http://www.slaverymuseum.org/IraAtTable.jpeg""HTTP/1.1"404"Not"Found"
Server:"Tengine/2.1.0"
Date:"Tue,"04"Aug"2015"18:17:46"GMT"Content'Type:"text/html;charset=utf'8"
Connection:"keep'alive"
set'cookie:"wayback_server=73;"Domain=archive.org;"Path=/;"Expires=Thu,"03'Sep'15"18:17:45"GMT;"
X'Archive'Wayback'Runtime'Error:"ResourceNotInArchiveException:"http://www.slaverymuseum.org/IraAtTable.jpeg"was"not"found"X'Archive'Wayback'Perf:"{"IndexLoad":144,"IndexQueryTotal":144,"RobotsFetchTotal":2,"RobotsRedis":1,"RobotsTotal":2,"Total":390}"
X'Archive'Playback:"0"
13
![Page 103: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/103.jpg)
14
URI requested
![Page 104: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/104.jpg)
15
Referring URI
![Page 105: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/105.jpg)
Check full URI in the live web
">"curl"'I"http://www.slaverymuseum.org/IraAtTable.jpeg"
HTTP/1.1"404"Not"Found"Date:"Tue,"04"Aug"2015"18:15:34"GMT"
Server:"Apache"Content'Type:"text/html;"charset=iso'8859'1"
16
![Page 106: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/106.jpg)
Check Timetravel
17
![Page 107: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/107.jpg)
Check domain in the live web
>"curl"'I"http://www.slaverymuseum.org"HTTP/1.1"301"Moved"Permanantly"
Date:"Tue,"04"Aug"2015"18:26:41"GMT"Server:"Apache"
Location:"https://vimeo.com/search?q=slaverymuseum.org"Content'Type:"text/plain;"charset=UTF'8"
18
![Page 108: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/108.jpg)
Check image name in new page " Not found
19
![Page 109: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/109.jpg)
Check leaf page for image name
20
" Not found
![Page 110: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/110.jpg)
Check domain in the IA
21
![Page 111: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/111.jpg)
Check search engine for image surrounding text
" Using the “src” and saving the “alt” in HTML (alternative information) as a back up.
e.g. " Image src="IraAtTable.jpeg” " alt="Ira)Hunter,)Jr.)and)Oni)Lasana
<img)border="0")src="IraAtTable.jpeg")width="120")height="97")align="top")alt="Ira)Hunter,)Jr.)and)Oni)Lasana)">)
22
![Page 112: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/112.jpg)
Searching Google for (IraAtTable.jpeg)
23
![Page 113: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/113.jpg)
24
Found same src name and parts of the surrounding text
![Page 114: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/114.jpg)
http://signhom.net/professionalshub/wp-content/uploads/sites/3/2013/11/IraAtTable.jpg
25
![Page 115: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/115.jpg)
>"curl"–I"http://web.archive.org/web/20110101013058/http://www.slaverymuseum.org/IraAtTable.jpeg""
210"Not"Quite"OK,"But"Close"
Date:"Wed,"05"Aug"2015"12:56:03"GMT"Location:"http://signhom.net/professionalshub/wp'content/uploads/sites/3/2013/11/IraAtTable.jpg"
26
New response code
![Page 116: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/116.jpg)
Summary of approaches
" Check full URI in the live web " Check full in URI the IA " Check full in URI the timetravel " Check domain in the live web " Check domain in IA " Check images in the redirected webpage " Check leaf pages " Check surrounding text in search engines " Compare results of different search engine using image
duplication, such as Google large-scale analysis of images: http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
27
![Page 117: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/117.jpg)
Other ideas Image de-duplication
" JCDL 2015: “Identifying Duplicate and Contradictory Information in Wikipedia”, by Sarah Weissman, Samet Ayhan, Joshua Bradley, Jimmy Lin
" Can we do the same for the archives by detecting and removing duplicate images
" How many duplicate images? " Which version should be kept?
28
![Page 118: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/118.jpg)
3URILOLQJ�:HE�$UFKLYHV
6DZRRG�$ODP�DQG�0LFKDHO�/��1HOVRQ&RPSXWHU�6FLHQFH�'HSDUWPHQW��2OG�'RPLQLRQ�8QLYHUVLW\
1RUIRON��9LUJLQLD��������
+HUEHUW�9DQ�GH�6RPSHO��/\XGPLOD�/��%DODNLUHYD��DQG�+DULKDU�6KDQNDU/RV�$ODPRV�1DWLRQDO�/DERUDWRU\��/RV�$ODPRV��10
'DYLG�6��+��5RVHQWKDO6WDQIRUG�8QLYHUVLW\�/LEUDULHV��6WDQIRUG��&$
![Page 119: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/119.jpg)
0HPHQWR�$JJUHJDWRU
![Page 120: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/120.jpg)
0HPHQWR�$JJUHJDWRU
![Page 121: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/121.jpg)
0HPHQWR�$JJUHJDWRU
![Page 122: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/122.jpg)
0HPHQWR�$JJUHJDWRU
![Page 123: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/123.jpg)
0HPHQWR�$JJUHJDWRU
![Page 124: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/124.jpg)
0HPHQWR�$JJUHJDWRU
![Page 125: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/125.jpg)
/RQJ�7DLO�RI�$UFKLYHV
![Page 126: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/126.jpg)
/RQJ�7DLO�RI�$UFKLYHV
Ɣ ���%��ZHE�SDJHV�DW�,$�GR�QRW�FRYHU�HYHU\WKLQJ
Ɣ 7RS�WKUHH�DUFKLYHV�DIWHU�,$�SURGXFH�IXOO�7LPH0DS�����RI�WKH�WLPH��$O6XP�HW�DO��73'/������
Ɣ 7DUJHWHG�FUDZOVƔ 6SHFLDO�IRFXV�DUFKLYHVƔ 5HVWULFWHG�UHVRXUFHVƔ 3ULYDWH�DUFKLYHV
![Page 127: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/127.jpg)
$UFKLYH�3URILOH
Ɣ +LJK�OHYHO�VXPPDU\�RI�DQ�DUFKLYHƔ 3UHGLFWV�SUHVHQFH�RI�PHPHQWRV�RI�D�85,�5�
LQ�DQ�DUFKLYHƔ 3URYLGHV�YDULRXV�VWDWLVWLFV�DERXW�WKH�
KROGLQJVƔ 6PDOO�LQ�VL]HƔ 3XEOLFO\�DYDLODEOHƔ (DV\�WR�XSGDWH�DQG�SDUWLDOO\�SDWFKƔ 8VHIXO�IRU�0HPHQWR�TXHU\�URXWLQJ�DQG�RWKHU�
WKLQJV
![Page 128: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/128.jpg)
$YDLODEOH�3URILOLQJ�5HVRXUFHV
Ɣ &OLHQW�UHTXHVWƔ $UFKLYH�UHVSRQVHƔ $UFKLYH�LQGH[��&';�ILOHV�
![Page 129: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/129.jpg)
$�&OLHQW�5HTXHVW
![Page 130: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/130.jpg)
$Q�$UFKLYH�5HVSRQVH
![Page 131: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/131.jpg)
$�&';�6QLSSHW
![Page 132: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/132.jpg)
3URILOLQJ�6WUDWHJLHV
Ɣ &RPSOHWH�85,�5�3URILOLQJ����85,�5� ���3URILOH�.H\�ż EEF�FR�XN�LPDJHV�ORJR�SQJ"Z ��ż FQQ�FRP������������"LG ������
Ɣ 7/'�RQO\�3URILOLQJ����7/'� ���3URILOH�.H\�ż FRP��ż XN��
Ɣ 0LGGOH�*URXQGż XN�FR��ż XN�FR�EEF��LPDJHVż XN�FR�EEF�������ż FRP�FQQ����������DU
![Page 133: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/133.jpg)
)UHTXHQF\�0HDVXUHPHQWV
![Page 134: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/134.jpg)
&';-�6HULDOL]DWLRQ
![Page 135: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/135.jpg)
85,�.H\�*HQHUDWLRQ
![Page 136: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/136.jpg)
3URILOH�0HUJLQJ
%DVH�SURILOH
1HZ�SURILOH
0HUJHG�SURILOH
![Page 137: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/137.jpg)
'DWDVHW
Ɣ 7KUHH�DUFKLYHVƔ )RXU�VDPSOH�TXHU\�VHWVƔ ���SURILOHV�IRU�HDFK�DUFKLYH�DQG�VDPSOH�VHW
![Page 138: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/138.jpg)
$UFKLYHV
$UFKLYH 85,�5V 85,�0V 6L]H
$UFKLYH�,W ���% ���% ���7%
8.:$ ���% ���% ���7%
6WDQIRUG ��0 ��0 ���*%
![Page 139: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/139.jpg)
6DPSOH�4XHU\�6HWV
6DPSOH ,Q�$UFKLYH�,W ,Q�8.:$ ,Q�6WDQIRUG
'02= ������ ������ ������
0HPHQWR3UR[\ ������ ������ ������
,$:D\EDFN ������ ������ ������
8.:D\EDFN ������ ������ ������
6DPSOH�6L]H���0�85,V�(DFK
![Page 140: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/140.jpg)
(YDOXDWLRQ
Ɣ 5HODWH�&';�6L]H��85,�0��85,�5��DQG�85,�.H\
Ɣ $QDO\]H�SURILOH�JURZWKƔ (VWLPDWH�5HODWLYH�&RVWƔ (YDOXDWH�5RXWLQJ�3UHFLVLRQ�YV��5HODWLYH�&RVW
![Page 141: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/141.jpg)
&';�6L]H�YV�85,�0��8.:$����<HDUV�
$OSKD������E\WHV�SHU�&';�OLQH
![Page 142: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/142.jpg)
85,�0�YV�85,�5��8.:$����<HDUV�
*DPPD������ .�����������%HWD�������
![Page 143: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/143.jpg)
6SDFH�&RVW��8.:$���<HDUV�
3KL�����H��������������
![Page 144: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/144.jpg)
7LPH�&RVW��8.:$���<HDUV�
7DX�����������H����������H���&';��������*%85,�0V�����085,�5V����07LPH��������KRXUV
![Page 145: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/145.jpg)
5HVRXUFH�5HTXLUHPHQW
![Page 146: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/146.jpg)
$UFKLYH�,W
![Page 147: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/147.jpg)
8.:$
![Page 148: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/148.jpg)
6WDQIRUG
![Page 149: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/149.jpg)
&RVW�YV�3UHFLVLRQ
*URXS &RVW 3UHFLVLRQ
*���+�3��7/'� %RXQG�E\���RI�7/'V ������
*���+�3���''RP��'6XE��'3WK��'4U\� ������ §��� �*�
*���',QL� §��� �*� §�������� �*�
*���+[3�� §��� �*� §�������� �*�
*���+LJKHU�+P3Q� ���������� 1RW�([SORUHG
*���85,5� ��� ���
![Page 150: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/150.jpg)
)XWXUH�:RUN
Ɣ *HQHUDWLQJ�VDPSOH�85,�VHWVƔ 3URILOLQJ�YLD�VDPSOLQJƔ /DQJXDJH�SURILOHVƔ (YDOXDWLRQ�RI�FRPELQDWLRQ�SURILOHV�VXFK�DV�
85,�.H\�DORQJ�ZLWK�'DWHWLPHƔ 3URILOHV�IRU�XVDJH�RWKHU�WKDQ�0HPHQWR�
URXWLQJ��VXFK�DV��VLWH�FODVVLILFDWLRQ�EDVHG�SURILOHV��H�J���QHZV��ZLNL��VRFLDO�PHGLD��EORJ�HWF��
![Page 151: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/151.jpg)
&RQFOXVLRQV
Ɣ *HQHUDWHG�SURILOHV�ZLWK�GLIIHUHQW�SROLFLHV�IRU�WZR�DUFKLYHV
Ɣ ([DPLQHG�FRVW�SUHFLVLRQ�WUDGHRIIV�RI�YDULRXV�SROLFLHV
Ɣ 5HODWHG�&';�6L]H��85,�0��85,�5��DQG�85,�.H\
Ɣ *DLQHG�XS�WR�����URXWLQJ�SUHFLVLRQ�ZLWK�����UHODWLYH�FRVW�ZLWKRXW�DQ\�IDOVH�QHJDWLYHV
Ɣ &RGH�#�*LW+XE��RGXZVGO�DUFKLYHBSURILOHU
![Page 152: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/152.jpg)
What has Justin been up to, lately?
Justin F. BrunellePresentation for Herbert Van de Sompel
08/06/2015
![Page 153: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/153.jpg)
A simpler time...
![Page 154: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/154.jpg)
Mass hysteria. Human sacrifices. Dogs and cats living together.
<iframe><script>...</script></iframe>
![Page 155: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/155.jpg)
Missing resources (bad) and Temporal violations (worse)
http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
![Page 156: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/156.jpg)
http://en.wikipedia.org/wiki/Main_Page January 18th, 2012
![Page 157: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/157.jpg)
http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page:
January 18th, 2012
![Page 158: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/158.jpg)
Not all tools can crawl equally
Live Resource PhantomJS Crawled
Heritrix Crawled, Wayback replayed
![Page 159: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/159.jpg)
CurrentWork4ow• Dereference URI-Rs• Archive • representation• Extract embedded • URI-Rs• Repeat
![Page 160: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/160.jpg)
Proposed Workflow
![Page 161: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/161.jpg)
<script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives!
Current workflow not suitable for deferred representations
Use PhantomJS to run JavaScript, interact with the representation
Two-tiered crawling approach to optimize performance
![Page 162: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/162.jpg)
<script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives!
Current workflow not suitable for deferred representations
Use PhantomJS to run JavaScript, interact with the representation
Two-tiered crawling approach to optimize performance
More URI-Rs in the crawl frontier
Runs more slowly but more deeply
![Page 163: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/163.jpg)
Run-time & Frontier size PhantomJS vs. Heritrix
To appear: iPres2015
![Page 164: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/164.jpg)
Constructed a classi=er for Deferred Representations
![Page 165: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/165.jpg)
Performance metrics of a two-tiered crawling approach
![Page 166: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/166.jpg)
The classi=er helps crawl deferred representations most e>ciently
![Page 167: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/167.jpg)
Current & Future Work
Using PhantomJS to execute actions on the client
– Pushing buttons
– Selecting drop-downs
– Archiving resulting representation changes
Represent representation state in WARCs
– Graph structure of embedded resources
– Replay in the Wayback Machine
16
http://ws-dl.blogspot.com/2015/06/2015-06-26-phantomjsvisualevent-or.html
![Page 168: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/168.jpg)
Presented(by(Mat(Kelly(for(Herbert(Van(de(Sompel(
!
Web$Science$and$Digital$Libraries$Research$Lab$Old(Dominion(University,(Norfolk,(VA(
August(6,(2015(
![Page 169: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/169.jpg)
• Software as a support vehicle
• Issues investigating for PhD research topic
• Sample access patterns mitigated by new Memento-related entities
HVDS(PresentaFon( 2(
![Page 170: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/170.jpg)
Building Software as a PhD Researcher
SoGware(as(a(Support(Vehicle(
![Page 171: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/171.jpg)
• Purpose: capture what user sees into WARC – instead of delegation-by-URI
• Barriers: – Restrictive browser extension API (Evolved/time) – Wheel inventing (nothing for WARCs in JS)
• Perks: – Seeded private web archiving research – Exposed hard-to-archive content
Website:$hKp://warcreate.com(
Blog:$hKp://wsOdl.blogspot.com/2013/07/2013O07O10OwarcreateOandOwailOwarc.html(
![Page 172: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/172.jpg)
• “Glue” between institutional tools – hard to configure and use
• Native binaries – difficult to maintain but novel
• Further facilitated private web archiving interest
Website:$hKp://matkelly.com/wail(
Blog:$hKp://wsOdl.blogspot.com/2013/07/2013O07O10OwarcreateOandOwailOwarc.html(
![Page 173: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/173.jpg)
• Integrates live + archived web experience
• Become familiar with Memento dynamics & usage patterns
• Provide eventual hook into new entities
Website:$hKp://matkelly.com/mink(
Blog:$hKp://wsOdl.blogspot.com/2014/10/2014O10O03OintegraFngOliveOand.html(
![Page 174: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/174.jpg)
• Given same input (URI), tools produce varying output
• Experiment to measure variance
• Identified hard-to-archive resources
• Highlighted cutting edge browser-crawler �
Website:$hKp://acid.matkelly.com(
Blog:$hKp://wsOdl.blogspot.com/2014/07/2014O07O14OarchivalOacidOtest.html(
![Page 175: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/175.jpg)
Current Research
![Page 176: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/176.jpg)
private(
archive(
private(
archive(
other(
private(
archive(
other(
private(
archive(
HVDS(PresentaFon( 9(
![Page 177: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/177.jpg)
private(
archive(
private(
archive(
other(
private(
archive(
TimeMap
other(
private(
archive(
HVDS(PresentaFon( 10(
![Page 178: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/178.jpg)
t = k! t = k-1!≠
HVDS(PresentaFon( 11(
![Page 179: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/179.jpg)
HVDS(PresentaFon( 12(
![Page 180: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/180.jpg)
90 DAYS AT A TIME
ONLY BACK TO ONE YEAR!
HVDS(PresentaFon( 13(
![Page 181: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/181.jpg)
1(year(ago( 2(year(ago( 10(year(ago(
…(
180(days(ago(
TimeMap
HVDS(PresentaFon( 14(
![Page 182: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/182.jpg)
private(
archive(
HVDS(PresentaFon( 15(
![Page 183: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/183.jpg)
HVDS(PresentaFon( 16(
Facebook.com$replay$
What(is(expected( What(the(tools(captured(
![Page 184: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/184.jpg)
Internet Archivepublic, aggregated
Archive.todaypublic, aggregated
Foo Archivespublic, non-aggregated
My web archiveprivate, non-aggregated
time →Archives capturingMy homepage
Changes tomy homepage
HVDS(PresentaFon( 17(
![Page 185: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/185.jpg)
Internet Archivepublic, aggregated
Archive.todaypublic, aggregated
Foo Archivespublic, non-aggregated
My web archiveprivate, non-aggregated
time →Archives capturingMy homepage
Changes tomy homepage
HVDS(PresentaFon( 18(
![Page 186: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/186.jpg)
Sample Access Patterns
![Page 187: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/187.jpg)
OR$TimeMap
HVDS(PresentaFon( 20(
![Page 188: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/188.jpg)
• More mementos from a superset of sources
TimeMap
HVDS(PresentaFon( 21(
![Page 189: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/189.jpg)
• Patterns 1 and 2 are status quo – provided by framework
• Querying web archives currently only considers public web content – URI for lookup
• Framework introduces 2 new entities – Memento Meta Aggregator (MMA)
– Private Web Archive Adapter (PWAA)
HVDS(PresentaFon( 22(
![Page 190: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/190.jpg)
• Functional superset of (MA)
• Can act as intermediary client to relay MA results to ultimate user
• Allows just-in-time (JIT) inclusion of archives – as specified at query time
• Set of archives aggregated can be dynamic – e.g., Results must not include IA
HVDS(PresentaFon( 23(
![Page 191: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/191.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
Various(public(web(archives(
My(web(archives(
HVDS(PresentaFon( 24(
![Page 192: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/192.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
100(
30(
10(
HVDS(PresentaFon( 25(
![Page 193: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/193.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
100(
30(
10(
HVDS(PresentaFon( 26(
![Page 194: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/194.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
NOT$AGGREGATED$
NOT$AGGREGATED$
100(
30(
10(
140(
HVDS(PresentaFon( 27(
![Page 195: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/195.jpg)
HVDS(PresentaFon( 28(
![Page 196: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/196.jpg)
HVDS(PresentaFon( 29(
![Page 197: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/197.jpg)
Access(via(the(Meta(Aggregator(
(
MY$CAPTURES$
MY$BANK$CAPTURES$
100(
30(
10(
140(140(
HVDS(PresentaFon( 30(
![Page 198: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/198.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
Access(via(the(Meta(Aggregator(
…allows(our(archives(to(be(included(
100(
30(
10(
15(
140(155(
HVDS(PresentaFon(
![Page 199: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/199.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
100(
30(
10(
15(
140(155(
155(
155(
HVDS(PresentaFon( 32(
![Page 200: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/200.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
…(
Bob’s$public$CAPTURES$
The$organizaLon’s$public$CAPTURES$1$
The$organizaLon’s$public$CAPTURES$2$
contains$A$B$C$D$
Contains$B$C$D$
Contains$C$D$
A
B C(
D
10(
5(
15(
15(
20(
35(
35(
15(
50(
50(
HVDS(PresentaFon( 33(
![Page 201: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/201.jpg)
• Allow dynamic and JIT set of archives • Superset can be recursively constructed • Sets can be shared
My public captures!can be integrated !
with public web archives’!HVDS(PresentaFon( 34(
![Page 202: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/202.jpg)
HVDS(PresentaFon( 35(
![Page 203: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/203.jpg)
• Regulates access to Private Web Archives (PWAs)
• Acts as token authorizer
• With correct credentials, relays results as if querying the PWA directly
HVDS(PresentaFon( 36(
![Page 204: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/204.jpg)
MY$CAPTURES$
37(
MY$BANK$CAPTURES$
GET(TOKEN(for(PWA(
Key:(abcd1234(
HVDS(PresentaFon(
100(
30(
10(
3!captures!
10,000!captures!
![Page 205: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/205.jpg)
MY$CAPTURES$
38(
MY$BANK$CAPTURES$
GET(TOKEN(for(PWA(
Key:(abcd1234(
HVDS(PresentaFon(
100(
30(
10(
3!captures!
10,000!captures!
![Page 206: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/206.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
ACCESS(OK(
Token:(4f33c64(
100(
30(
10(
3!captures!
10,000!captures!
HVDS(PresentaFon( 39(
![Page 207: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/207.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
GET(mementos(for(URI(
Token:(4f33c64(
100(
30(
10(
3!captures!
10,000!captures!
HVDS(PresentaFon( 40(
![Page 208: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/208.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
GET(mementos(for(URI(
Token:(4f33c64(
100(
30(
10(
3!captures!
10,000!captures!
HVDS(PresentaFon( 41(
![Page 209: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/209.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
Token:(4f33c64(
OK(
GET(mementos(for(URI(
GET(mementos(for(URI(
100(
30(
10(
3!captures!
10,000!captures!
HVDS(PresentaFon( 42(
![Page 210: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/210.jpg)
MY$CAPTURES$
MY$BANK$CAPTURES$
Token:(4f33c64(OK(
Returning(mementos(
Return(mementos(
For(URI(
100(
30(
10(
3!captures!
10,000!captures!
HVDS(PresentaFon( 43(
![Page 211: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/211.jpg)
MY$CAPTURES$
44(
MY$BANK$CAPTURES$
TimeMap
TimeMap
TimeMap
HVDS(PresentaFon(
100(
30(
10(
3!captures!
10,000!captures!
140(
10,000
(
10,000(
10,143(140!captures!
![Page 212: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/212.jpg)
MY$CAPTURES$
45(
MY$BANK$CAPTURES$
TimeMapTimeMapTimeMap
HVDS(PresentaFon(
100(
30(
10(
3!captures!
10,000!captures!
10,143(
140!captures!!!3!captures!!!!!10,000!captures!
![Page 213: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/213.jpg)
MY$CAPTURES$
46(
MY$BANK$CAPTURES$
TimeMap
HVDS(PresentaFon(
100(
30(
10(
3!captures!
10,000!captures!
10,143!captures!
![Page 214: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/214.jpg)
... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 16:39:39 GMT"
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
datetime="Tue, 05 Mar 2015 21:59:22 GMT"
, <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
, <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento";
datetime="Tue, 10 Mar 2015 14:07:21 GMT" ...
TimeMap
![Page 215: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/215.jpg)
... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 16:39:39 GMT"
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
datetime="Tue, 05 Mar 2015 21:59:22 GMT"
, <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
, <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento";
datetime="Tue, 10 Mar 2015 14:07:21 GMT" ...
MY$PRIVATE$FACEBOOK$CAPTURES$
![Page 216: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/216.jpg)
... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 16:39:39 GMT"
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
datetime="Tue, 05 Mar 2015 21:59:22 GMT"
, <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
, <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento";
datetime="Tue, 10 Mar 2015 14:07:21 GMT" ...
MY$PRIVATE$FACEBOOK$CAPTURES$
NOT RFC 5988 COMPLIANT!
![Page 217: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/217.jpg)
... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 16:39:39 GMT"
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
datetime="Tue, 05 Mar 2015 21:59:22 GMT"
, <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
, <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento";
datetime="Tue, 10 Mar 2015 14:07:21 GMT" ...
MY$PUBLIC$FACEBOOK$CAPTURES$
![Page 218: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/218.jpg)
MY$CAPTURES$
51(
MY$BANK$CAPTURES$
GET(mementos(for(URI(
Token:(4f33c64(
GET(mementos(for(URI(
Token:(c5463b4(
GET(TOKEN(for(PWA(
Key:(2265eef3(
No/invalid!token!returned!
Access!denied!or$0!mementos!
HVDS(PresentaFon(
3!captures!
10,000!captures!
![Page 219: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/219.jpg)
HVDS(PresentaFon( 52(
MY$BANK$CAPTURES$
Linda’s$Private$Captures$
Bob’s$Private$Captures$
GET(TOKENs(for(PWAs(
Key:(abcd1234,(Archive:(My(
Key:(cab45cbf,(Archive:(Linda$Key:(b0b01b,(Archive:(Bob$
3!captures!
5!captures!
10!captures!
5(
3(
10(
![Page 220: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/220.jpg)
HVDS(PresentaFon( 53(
MY$BANK$CAPTURES$
Access(OK(
Token:(7790ca(
Access(OK(
Token:(b0b01b(
ACCESS$DENIED$
Linda’s$Private$Captures$
Bob’s$Private$Captures$
3!captures!
5!captures!
10!captures!
5(
3(
10(
![Page 221: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/221.jpg)
HVDS(PresentaFon( 54(
MY$BANK$CAPTURES$
GET(mementos(for(URI(
Token:(7790ca,((Archive:(My(
Token:(null,(Archive:(Linda$Token:(b0b01b,(Archive:(Bob$
Linda’s$Private$Captures$
Bob’s$Private$Captures$
3!captures!
5!captures!
10!captures!
5(
3(
10(
3(
10(
ø(13(
![Page 222: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/222.jpg)
• Preserve Private Web Content
HVDS(PresentaFon(
• Simulate & Quickly Deploy Private Web Archives
• Interface with New Entities Using Memento
New(SoGware:(
&(
![Page 223: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/223.jpg)
• Background research on state-of-the-art
• Exploring use cases – Both existing, anticipated, and fabricated
• Resisting desire to code
HVDS(PresentaFon(
56(
&(
56(
![Page 224: @WebSciDL PhD Student Project Reviews August 5&6, 2015](https://reader030.vdocuments.site/reader030/viewer/2022032514/55d6dea6bb61eb49538b4787/html5/thumbnails/224.jpg)
• Why? – No means exists to integrate private and public
web archives.
• How to Evaluate? – Does this framework fit real world needs?
Scalable?
• When will I know I am done? – Any public/private web archive* can be
integrated.
*((((((((((((Ocompliant(