web archiving challenges and opportunities
DESCRIPTION
This is my presentation for job interview as web archiving engineer at Stanford university libraries on Oct 25.TRANSCRIPT
![Page 1: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/1.jpg)
WEB ARCHIVING CHALLENGES & OPPORTUNITIESPRESENTATION FOR WEB ARCHIVING ENGINEERING POSITION
Ahmed AlSumPhD Candidate
Old Dominion University
![Page 2: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/2.jpg)
Outline• Engineering Experience
• IBM• Old Dominion University• Internet Archive
• Web Archiving Challenges & Opportunities• Selection• Harvesting• Storage• Access• Community
• Conclusions
![Page 3: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/3.jpg)
Cairo, Egypt2006 - 2009
![Page 4: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/4.jpg)
CCSP Project• An internal IBM support portal that provides client-facing
audiences a by-client, holistic view of client situations• Technologies: WebSphere Portal, DB2, deployed on
zLinux machines
![Page 5: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/5.jpg)
Responsibilities• Software Engineer
• Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and backend tasks based on EJB
• Front-end components based on Web 20 technologies (AJAX based on dojo 1.0, and Java Script)
• Lotus Sametime (Plugins and Bot development)
• Software engineer team leader• Support project quality activities• Lead code review and static analysis activities
![Page 6: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/6.jpg)
Responsibilities• Administrator
• Deploying Portal solutions on WebSphere Portal• WebSphere Portal Administration for standalone and clustered
environment• Administration on Linux and Windows OS• DB2 server administration for single instance and multiple
instances with HADR support
• Customer support team lead• Leading customer support activities
![Page 7: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/7.jpg)
Certifications
![Page 8: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/8.jpg)
Sharing IBM Internal Solutions with Broader Community
![Page 9: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/9.jpg)
Norfolk, VA USA2009 - 2013
![Page 10: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/10.jpg)
Memento• Memento is an HTTP
extension to integrate the Past and the Current Web
I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/
Now
T1
T2
T3
![Page 11: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/11.jpg)
Memento
• Developer and administrator for Memento aggregator and proxies
![Page 12: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/12.jpg)
Memento Clients
• Memento currently is I-D draft, it is promoted to move to RFC soon.
![Page 13: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/13.jpg)
San Francisco, CA USA2012
![Page 14: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/14.jpg)
WAT Extraction• Web Archive Transformation (WAT) is a specification for
structuring metadata generated by Web crawls• Technologies:
![Page 15: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/15.jpg)
WEB ARCHIVING
Challenges and Opportunities
![Page 16: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/16.jpg)
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
![Page 17: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/17.jpg)
Selection• Decide what to capture
Everything, any domain
National domains
Delegate selection to partners
Users’ favorites
• We studied what is already captured
![Page 18: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/18.jpg)
How Much Of The Web Is Archived?
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson
In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada 2011
See also: http://arxiv.org/abs/1212.6177
![Page 19: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/19.jpg)
Archive categories
We have 3 categories of archives• Internet Archive (classic interface) • Search engine • Other archives
Selection
UK
US
Public Archives, ca. Late 2010 / Early 2011
![Page 20: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/20.jpg)
1000 URIs Ordered by First Observation Date
Selection
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
![Page 21: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/21.jpg)
Memento Distribution, ordered by the first observation date
![Page 22: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/22.jpg)
How Much of the Web is Archived?It Depends on Which Web…
Selection
Including SE cache
Excluding SE Cache
90% 79%
97% 68%
88% 19%
35% 16%
Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives
2013
95%
92%
23%
26%
![Page 23: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/23.jpg)
Profiling Web Archive Coverage For Top-level Domain And Content Language
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel
In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013
See also: http://arxiv.org/abs/1309.4008
![Page 24: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/24.jpg)
Where is it archived?
Selection
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
![Page 25: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/25.jpg)
Language Coverage
Selection
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
![Page 26: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/26.jpg)
Growth Rate
Selection
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Borrowed Portuguese material from IA
Stopped archiving since 2008
Steady growth
Stopped getting new URIs, but still crawling
![Page 27: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/27.jpg)
Selection Research Output• Some portions of the web are
not well archived such as India and Africa.
• Profiling helping us in Memento query routing.
• IIPC proposal with Herbert Van de Sompel (LANL) and David Rosenthal (SUL).
Selection
![Page 28: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/28.jpg)
Selection at SUL• Focus on the missing parts of the Web• Twitter - Crowdsource:
• UK Web archive: Twittervana• Internet Memory: Collect URIs from twitter APIs• VA Tech: CTRNET project
• Stanford Community• World News collection: 10 news website from each county
• Tools:
Selection
![Page 29: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/29.jpg)
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
![Page 30: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/30.jpg)
Harvesting• Services
• Archive-It• WAS @ CDLib
• Dedicated servers
• New tools
See also: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html
![Page 31: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/31.jpg)
Special Harvesting Techniques• Borrow old materials from other web archives• Ex Stanford WebBase Project*
• 260 TB• 7 Billion webpages
Harvesting
*http://www-diglib.stanford.edu/~testbed/doc2/WebBase/
![Page 32: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/32.jpg)
Special Harvesting Techniques• Social Media
• Focus on shared resources in the social media
Harvesting
Hany M SalahEldeen, Michael L Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
![Page 33: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/33.jpg)
Special Harvesting Techniques• SiteStory - Transactional Archive
Harvesting
Justin F Brunelle, Michael L Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013Sitestory: http://mementoweb.github.io/SiteStory/
![Page 34: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/34.jpg)
Harvesting • Challenges
• Ajax and Web 2.0/3.0• Streaming Media• URI challenges • Mobile
Harvesting
http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.htmlhttp://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf
![Page 35: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/35.jpg)
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
![Page 36: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/36.jpg)
Storage (Format)• Flat files:
• WARC files (ISO standard)
• No-SQL db:• Hbase at Internet memory*
• Storage at SUL:• We need to use both
Storage
*Philippe Rigaux, Understanding HBase— The data model, IM technology blog http://internetmemoryorg/en/indexphp/synapse/understanding_the_hbase_data_model/
![Page 37: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/37.jpg)
Storage (Infrastructure)• Wrong solution could be a disaster
Storage
![Page 38: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/38.jpg)
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
![Page 39: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/39.jpg)
Accessing Web Archive
URI-BasedWayBack Machine
• Textbox to enter the requested URI
• BubbleMap to show you the available mementos
![Page 40: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/40.jpg)
Accessing Web Archive
Full-text search
• Challenges: Temporal Page Rank, Rank per site or memento, Date filtering
![Page 41: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/41.jpg)
Accessing Web Archive• Thumbnail View
• Trade-off between building the thumbnail in real time or pre-building Also, trade-off between representing the thumbnail by URI or by embedded binary data Can we build partial thumbnail map?
![Page 42: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/42.jpg)
Accessing Web Archive• Title View
• Trade-off between, extracting all the titles and keeping it as a metadata about the memento and extracting the title from the HTML content on the real time
Implemented using Simile: http://www.simile-widgets.org/timeline/
![Page 43: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/43.jpg)
Accessing Web Archive• Wayback Machine API
• XML interface for the list of available Mementos
![Page 44: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/44.jpg)
Accessing Web Archive• Web Page Snapshot Replay
• URI rewriting, javascript, and embedded resources
![Page 45: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/45.jpg)
Accessing Web Archive• Page Completeness Degree
• The completeness degree could be calculated on the real time by using the preserved HTTP status for the embedded resources
See also: http://arxiv.org/abs/1309.5503
![Page 46: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/46.jpg)
Accessing Web Archive• Reconstructing web site
• Current approach is using the web archive public interface.
![Page 47: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/47.jpg)
Accessing Web Archive• Wayback Annotator
• Create collections• Select and save
relevant content to their collections
• Annotate & mark important parts of archived web pages
• Share their work and collaborate on archived content use
http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdfhttp://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf
![Page 48: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/48.jpg)
Accessing Web Archive
Collection-Based
• In addition to browsing the collection, you can browse the URIs in this collection
• Research questions: Collection overview
![Page 49: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/49.jpg)
Accessing Web Archive• Collection visualization
• Term frequency algorithms should be normalized to take the mementos density in consideration
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
![Page 50: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/50.jpg)
Accessing Web Archive• Web Archive analytics
See also: http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf
• ArcSpread took a query from the user, extracted related information and displayed the results in spread sheet style.
![Page 51: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/51.jpg)
Who And What Links To The Internet Archive
Y. Alnoamany, A. AlSum, M. C. Weigle, M. L. Nelson
In Proceedings of 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013
(Best Student Paper)See also: http://arxiv.org/abs/1309.4016
![Page 52: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/52.jpg)
Serving Robots!• Log files analysis using Apache Pig• Access to IA wayback machine as
Robots outnumber Humans • 10:1 in terms of sessions, • 5:4 in terms of raw HTTP accesses • 4:1 in terms of megabytes transferred
Access
Sessions
10
1
HTTP accesses
5
4
MB Transferred
4
1
![Page 53: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/53.jpg)
Where do Wayback Machine Users Come From?
Website Percentage Descriptionen.wikipedia.org 12.9% Wikipedia archive.org 11.9% IA Home Page reddit.com 10.2% Social News Web Site google.TLD 9.9% Search Engine info-poland.buffalo.edu 1.5% Polish Studies de.wikipedia.org 1.4% Wikipedia cracked.com 1.2% Humor Site snopes.com 1.1% Urban Legends Reference Pages facebook.com 0.9% Social Media crochetpatterncentral.com 0.9% Crocheting Hobbies
Access
![Page 54: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/54.jpg)
Most Languages Self-Link
Access
![Page 55: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/55.jpg)
ArcLink:Optimization Techniques To Build And Retrieve The Temporal Web Graph
A. AlSum, M. L. Nelson
IIPC GA 2013, Ljubljana, Slovenia
In Proceedings of the 13th international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013
See also: http://arxiv.org/abs/1305.5959
![Page 56: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/56.jpg)
Easy Solved Questions
Q: What are the available mementos for vancouver2010.com?
Access
![Page 57: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/57.jpg)
Solved Questions, but hard
Q: What are the HTML titles for vancouver2010com through time?
A Page scraping for all mementos
Access
![Page 58: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/58.jpg)
Impossible Questions
Q What are the anchor-text that pointed to www.vancouver2010.com through time?
Access
…<a href=www.vancouver2010.com >Vancouver Olympics</a>….
…<a href=www.vancouver2010.com >Winter Olympics</a>…
…<a href=www.vancouver2010.com >Vancouver 2010</a>…
![Page 60: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/60.jpg)
Impossible Questions • Q What are the anchor-text that pointed to
www.vancouver2010.com through time?
Access
![Page 61: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/61.jpg)
Thumbnail Summarization Techniques For Web Archives
A. AlSum, and M. L. Nelson
Submitted for publication.
![Page 62: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/62.jpg)
Thumbnails
Access
Internet Archive UK Web archive
![Page 63: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/63.jpg)
Thumbnail Creation Challenges• Scalability in Time
• IA may need 361 years to create thumbnail per each memento using one hundred machine
• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento
• Page quality
Access
![Page 66: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/66.jpg)
40 Thumbnails are good.
Access
![Page 67: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/67.jpg)
Same technique applied to apple.com
Access
![Page 68: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/68.jpg)
From 8000 Mementos to 69 Thumbnails.
Access
![Page 69: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/69.jpg)
iTunes cover application
Access
![Page 70: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/70.jpg)
Community• I suggest to be a member in IIPC
• Join the open Wayback Machine team• Join the Winter Olympics 2014 collaborative project, even as an
observer
Congratulations
![Page 71: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/71.jpg)
Community• Web Archiving Workshops
WAC 2011, Ottawa, Canada
WAC 2012, Stanford, CA, USA
WADL 2013, Indianapolis, IN, USATempWeb 2013, Rio de Janeiro, Brazil
![Page 72: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/72.jpg)
Tools to SUL Web Archive• Selection
• Harvest
• Analysis
• Access
![Page 73: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/73.jpg)
Conclusions• Be Selective: Cover missing parts of the Web• Be Older: Include WebBase• Be Smart: Innovative services• Be Helpful: Researcher Framework/Dataset• Be Active: Participate in the WA communities
• Make a difference
[email protected]@aalsum
![Page 74: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/74.jpg)
BACKUP
![Page 75: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/75.jpg)
What is missing?
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TWNational Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
![Page 76: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/76.jpg)
Thumbnail Features
SimHash DOM tree
Embedded resources Datetime
![Page 77: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/77.jpg)
Clustering technique
![Page 78: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/78.jpg)
![Page 79: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/79.jpg)
![Page 80: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/80.jpg)
![Page 81: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/81.jpg)
Web Archive
Web Archive
![Page 82: Web archiving challenges and opportunities](https://reader036.vdocuments.site/reader036/viewer/2022062617/54c824374a795919758b45ac/html5/thumbnails/82.jpg)