filling in the blanks: capturing dynamically generated content
DESCRIPTION
JCDL 2012 Doctoral Consortium presentation by Justin F. Brunelle. Covers the problem Web 2.0 creates for preservation, and proposes a solution for client-side capture of content.TRANSCRIPT
![Page 1: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/1.jpg)
1
Filling in the Blanks: Capturing Dynamically
Generated ContentJustin F. Brunelle
Old Dominion UniversityAdvisor: Dr. Michael L. Nelson
JCDL ‘12 Doctoral Consortium06/10/2012
![Page 2: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/2.jpg)
2
![Page 3: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/3.jpg)
3
![Page 4: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/4.jpg)
4
Problem!• Which exists in the archive?
– Probably the unauthenticated version, right?
• What factors created “my” representation?– Can I archive “my” representation?
• Am I seeing undead resources?– Mix of live and archived content?
• How can we capture, share, and archive user experiences?
![Page 5: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/5.jpg)
5
Which version is in the Internet Archive?
![Page 6: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/6.jpg)
6
Which version is in WebCite?
![Page 7: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/7.jpg)
7
Craigslist.org$ curl -I -L http://www.craigslist.orgHTTP/1.1 302 FoundSet-Cookie: …Location: http://geo.craigslist.org/
HTTP/1.1 302 FoundContent-Type: text/html; charset=iso-8859-1Connection: closeLocation: http://norfolk.craigslist.orgDate: Thu, 31 May 2012 23:26:27 GMTSet-Cookie: …Server: Apache
HTTP/1.1 200 OKConnection: closeCache-Control: max-age=3600, publicLast-Modified: Thu, 31 May 2012 23:13:46 GMTSet-Cookie: …Transfer-Encoding: chunkedDate: Thu, 31 May 2012 23:13:46 GMTVary: Accept-EncodingContent-Type: text/html; charset=iso-8859-1;X-Frame-Options: Allow-From https://forums.craigslist.orgServer: ApacheExpires: Fri, 01 Jun 2012 00:13:46 GMT
![Page 8: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/8.jpg)
8
Live Resource Accessed from Norfolk
![Page 9: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/9.jpg)
9
Archived ResourceSubmitted from Norfolk
• Submitted to WebCite from Norfolk
![Page 10: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/10.jpg)
10
Live Norfolk Interactive Mapper
http://gisapp2.norfolk.gov/interactive_mapper/viewer.htm
![Page 11: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/11.jpg)
11
Archived Norfolk Interactive Mapper
http://web.archive.org/web/20100924020604/http://gisapp2.norfolk.gov/interactive_mapper/viewer.htm
![Page 12: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/12.jpg)
12
Web 2.0
• Crawlers aren’t enough
• Relies on interaction/personalization
• Users may want to archive personal content
• How do we capture user experiences?– Justin’s vs. Dr. Nelson’s experience? Both?
• What about sharing browsing sessions?
![Page 13: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/13.jpg)
13
Things are better (but really worse)
• Better UI, worse archiving
• HTML5
• JavaScript– document.write
• Cookies
• User Interaction
• GeoIP
![Page 14: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/14.jpg)
14
Dereference
URI
Representation
ResourceIdentifies
Represents
Traditional Representation generation
From W3C Web Architecture
![Page 15: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/15.jpg)
15
URI
Representation
ResourceIdentifies
Represents
Dereference Negotiate
Representation through content negotiation
From W3C Web Architecture
![Page 16: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/16.jpg)
16
Dereference
URI
Representation
ResourceIdentifies
Represents
Client-side script
User Interaction
Web 2.0 Representation Generation
![Page 17: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/17.jpg)
17
Prior Work• Capture for Debugging and Security
– Mickens, 2010; Livshits, 2007, 2009, 2010; Dhawan, 2009
• Crawlers– Mesbah, 2008; Duda, 2008; Lowet, 2009
• Caching dynamic content– Benson, 2010; Karri, 2009; Boulos, 2010; Periyapatna, 2009;
Sivasubramanian, 2007
• Walden’s paths– http://www.csdl.tamu.edu/walden/
• IIPC Workshop 2012: Archiving the Future Web– http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html
![Page 18: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/18.jpg)
18
Two Current Solutions
• Browser-based crawling– Problematic at scale, misses post-render content, no
session spanning, misses “personal” browsing– IA– To be released – Heritrix 3.X
• Transactional Web Archiving– Impact/depth is unknown, client-side changes are
missed, must have server/content author buy-in– LANL– http://theresourcedepot.org/
![Page 19: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/19.jpg)
19
What can Justin do about it?• How can we capture THE user experience?
– How much user-shared content is archivable?– What defines a dynamic representation?
• Infinitely Changing?
– How much dynamic content are archives missing?– What tools are required to capture the representation?
• Browser Add-on?
– How much will users contribute to the archives?
• Is this even possible?
![Page 20: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/20.jpg)
20
Characteristics of a Potential Solution
• Browser Add-on• Crowd sourced
– User contributions to archives
• Opt-in representation archiving/sharing• Capture client-side DOM
– JS, HTML, representation, etc.
• Capture client-side events and resulting DOM– Includes Ajax and post-render data
• Package and submit to archives
![Page 21: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/21.jpg)
21
![Page 22: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/22.jpg)
BEGIN
PhD Defense
Background ResearchCoursework
Dissertation Plan
Prevalence of Unarchivable Resources
Current State
QualsQualsQualsQuals
Define factors/equations of dynamic representations – What dynamic content can (and cannot) be captured for archiving?Construction of software solution -- VCR for the Web: Record, Rewind, Replay
Analysis of improved capture -- Client-side Archiving: Client-side (human assisted) Capture vs. Traditional Crawlers vs. Headless clients
Explore how personalized archives can be combined with public webarchives
Background Research
Define test datasets (set of dynamic and static test pages)
![Page 23: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/23.jpg)
23
Current Work:How much can we archive?
• Sample from Bit.ly URIs from Twitter
• Load page in each environment:– Live– 3rd Party Archived
• Submit and load from WebCitation
– Locally stored•wget –k -p and load from local drive
– Local only• Load from local drive – no Internet access
![Page 24: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/24.jpg)
24
Livehttp://dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/
![Page 25: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/25.jpg)
25
Archived (WebCite)http://www.webcitation.org/685EYfYEK
![Page 26: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/26.jpg)
26
Locally Storedhttp://localhost/dctheatrescene.com/2009/06/11/toxic-avengers-john-
rando/
![Page 27: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/27.jpg)
27
Local Only(No Internet)
http://localhost/dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/ • Missing:
12/78 without internet • dctheatrescene.com/…/uperfish.args.js?e83a2c • dctheatrescene.com/…/css/datatables.css?
ver=1.9.3
• Small files, bit impact
![Page 28: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/28.jpg)
28
Thought Experiment
![Page 29: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/29.jpg)
29
Double Click 4x
![Page 30: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/30.jpg)
30
Click and drag to left
![Page 31: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/31.jpg)
31
Submit to Archive
![Page 32: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/32.jpg)
32
Future Research Questions• What dynamism can (and cannot) be captured for
archiving?• Client-side Archiving: Client-side Capture vs.
Traditional Crawlers • Client-side contributions to Web Archives: Archiving
User Experiences
![Page 33: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/33.jpg)
33
Conclusion
• Is dynamic content archivable?
• How much are we missing?
• Can you archive your experience?
• For the betterment of archives
• For personal capture
![Page 34: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/34.jpg)
34
Backups
![Page 35: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/35.jpg)
35
References
• J. Mickens, J. Elson, and J. Howell. Mugshot: deterministic capture and replay for JavaScript applications. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI'10, pages 11-11, Berkeley, CA, USA, 2010. USENIX Association.
• K.Vikram, A. Prateek, and B. Livshits. Ripley: Automatically securing web 2.0 applications through replicated execution. In Proceedings of the Conference on Computer and Communications Security, November 2009.
• E. Kiciman and B. Livshits. Ajaxscope: A platform for remotely monitoring the client-side behavior of web 2.0 applications. In the 21st ACM Symposium on Operating Systems Principles (SOSP'07), SOSP '07, 2007.
• B. Livshits and S. Guarnieri. Gulfstream: Incremental static analysis for streaming JavaScript applications. Technical Report MSR-TR-2010-4, Microsoft, January 2010.
• M. Dhawan and V. Ganapathy. Analyzing information flow in JavaScript-based browser extensions. Annual Computer Security Applications Conference, pages 382 - 391, 2009.
• A. Mesbah, E. Bozdag, and A. van Deursen. Crawling Ajax by inferring user interface state changes. In Web Engineering, 2008. ICWE '08. Eighth International Conference on, pages 122-134, July 2008.
• C. Duda, G. Frey, D. Kossmann, and C. Zhou. AjaxSearch: crawling, indexing and searching Web 2.0 applications. Proc. VLDB Endow., 1:1440-1443, August 2008.
• D. Lowet and D. Goergen. Co-browsing dynamic web pages. In WWW, pages 941-950, 2009.
![Page 36: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/36.jpg)
36
References• S. Chakrabarti, S. Srivastava, M. Subramanyam, and M. Tiwari. Memex: A browsing
assistant for collaborative archiving and mining of surf trails. In Proceedings of the 26th VLDB Conference, 26th VLDB, 2000.
• R. Karri. Client-side page element web-caching, 2009.• E. Benson, A. M. 0002, D. R. Karger, and S. Madden. Sync kit: a persistent client-side
database caching toolkit for data intensive websites. In WWW, pages 121{130, 2010.• M. N. K. Boulos, J. Gong, P. Yue, and J. Y. Warren. Web gis in practice viii: Html5 and the
canvas element for interactive online mapping. International journal of health geographics, March 2010.
• S. Periyapatna. Total recall for Ajax applications firefox extension, 2009.• S. Sivasubramanian, G. Pierre, M. van Steen, and G. Alonso. Analysis of caching and
replication strategies for web applications. IEEE Internet Computing, 11:60-66, 2007.
![Page 37: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/37.jpg)
37
Web Archives
• “Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved … for future researchers, historians, and the public. “-- http://en.wikipedia.org/wiki/Web_archiving
![Page 38: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/38.jpg)
38
What does this have to do with DLs?
• Improved coverage
• NARA regulation
• Improved “memory”
• Gathers missing User Experiences– Or at least an adequate sub-sample
![Page 39: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/39.jpg)
39
Envisioned Solution
User Event: Text Entered
User Event: Double Click
User Event: Text Entered
Button Push
Ajax Event:
XMLResponse
Ajax Event:
XMLResponse
Ajax Event:
XMLResponse
SELECT PREVIOUS REPRESENTATION TO ARCHIVE:
![Page 40: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/40.jpg)
40
Google Maps
![Page 41: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/41.jpg)
41
Current Web Applications
![Page 42: Filling in the Blanks: Capturing Dynamically Generated Content](https://reader033.vdocuments.site/reader033/viewer/2022061300/54d1134e4a7959964d8b46fa/html5/thumbnails/42.jpg)
42
Web Applications with Session Archiver