scripts in a frame: a two-tiered crawling approach to archiving deferred representations

126
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations Justin F. Brunelle Dissertation Defense February 5, 2016 Committee Members: Michael L. Nelson Michele C. Weigle Elizabeth J. Vincelette Irwin B. Levinstein

Upload: justin-brunelle

Post on 08-Apr-2017

215 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Scripts in a Frame:A Two-Tiered Approach for Archiving

Deferred Representations

Justin F. Brunelle

Dissertation Defense

February 5, 2016

Committee Members:

Michael L. Nelson

Michele C. Weigle

Elizabeth J. Vincelette

Irwin B. Levinstein

Page 2: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

A simpler time…

2

Page 3: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Mass hysteria. Human sacrifices. Dogs and cats living together.

3

<iframe><script>…</script></iframe>

Page 4: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

4

t

Page 5: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

5http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html

Missing resources (bad)

2008

Page 6: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

6http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html

20082012

Missing resources (bad) and Temporal violations (worse)

Page 7: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Old ads are interesting

7

Page 8: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

New ones are annoying…for now.

8

“Why are your parents wrestling?”

Page 9: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Today’s ads are missing from the archives

9

http://adserver.adtechus.com/addyn/3.0/5399.1/2394397/0/-1/QUANTCAST;;size=300x250;target=_blank;alias=p36-17b4f9us2qmzc8bn;kvp36=p36-17b4f9us2qmzc8bn;sub1=p-4UZr_j7rCm_Aj;kvl=172802;kvc=794676;kvs=300x250;kvi=c052a803d0b5476f0bd2f2043ef237e27cd48019;kva=p-4UZr_j7rCm_Aj;rdclick=http://exch.quantserve.com/r?a=p-4UZr_j7rCm_Aj;labels=_qc.clk,_click.adserver.rtb,_click.rand.85854;rtbip=192.184.64.144;rtbdata2=EAQaFUhSQmxvY2tfMjAxNlRheFNlYXNvbiCZiRcogsYKMLTAMDoSaHR0cDovL3d3dy5jbm4uY29tWihUUEhwYlUzM3ZqeFU5LTA1SGZEMk1SXzE0anBVcGU0d0dxTG10STFUdUs2IECAAb_JicoFoAEBqAGhy7YCugEoVFBIcGJVMzN2anhVOS0wNUhmRDJNUl8xNGpwVXBlNHdHcUxtdEkxVMAB3ed3yAGUp7GUqSraAShjMDUyYTgwM2QwYjU0NzZmMGJkMmYyMDQzZWYyMzdlMjdjZDQ4MDE55QHvEWs-6AFkmAK2wQqoAgWoAgawAgi6AgTAuECQwAICyAIA0ALe9baMj4Cos-oB

Page 10: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

JavaScript is hard to replay

What happens when things are completely lost?http://ws-dl.blogspot.com/2013/11/2013-11-28-replaying-sopa-protest.html

10

Page 11: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Remember SOPA? And the protest?

11https://en.wikipedia.org/wiki/Stop_Online_Piracy_Acthttps://en.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA

Page 12: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 12

Page 13: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 13

Page 14: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

14

Problem!

The archives contain the Web as seen by crawlers

Page 15: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Why archive?

The Internet Archive has everything!

Why didn’t you back it up?

Participating institutions can hand over their databases.

15

Page 16: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Crimean Conflict

Russian troops captured the Crimean Center for Investigative Journalism

Gunman: "We will try to agree on the correct truthful coverage of events.”

16

http://gijn.org/2014/03/02/masked-gunmen-seize-crimean-investigative-journalism-center/

Page 17: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Archive-It to the rescue!

17

Page 18: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

How?

Masked gunman have your servers

Where are your backups?

Transactional archive? Too late!

18

Preservation over HTTP

Page 19: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

How?

Masked gunman have your servers

Where are your backups?

Transactional archive? Too late!

19

Preservation over HTTP

Page 20: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Any future discussion of the 21st

century will involve the web and the web archives

20

Page 21: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Any future discussion of the 21st

century will involve the web and the web archives

But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users

21

Page 22: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Any future discussion of the 21st

century will involve the web and the web archives

But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users

22

Goal: Mitigate the impact of JavaScript on the archives by making crawlers behave like users

Page 23: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

23

Motivating Examples

Background Information

Research Questions

Measuring the Impact of JavaScript

Measuring Memento Quality

Crawling Deferred Representations

Future Work

Conclusions

Page 24: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Some Institutional Archives

24

Page 25: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Some Page-at-a-time Archivers

25

Page 26: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Some Archival Tools

261: http://warcreate.com/2: http://matkelly.com/wail/

1

2

Page 27: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Memento Framework

27http://mementoweb.org/guide/rfc/

Machine readable bidirectional link between the past and present web

Page 28: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

28

Page 29: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

29

Page 30: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

30

URI-R: Original Resource Identifier

URI-M: memento Identifier

URI-T: TimeMapIdentifier

Page on the live web

Archived version of a page

List of archived pages

Page 31: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Web Architecture

31

Dereference a URI, get a representation

Page 32: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

JavaScript makes requests for new resources after the initial page load

32

http://maps.google.com

Identifies

Represents

Page 33: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Deferred Representation

33

http://maps.google.com

Identifies

Represents

Page 34: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

JavaScript != Deferred

34

Deferred

HTTP GETHTTP GET HTTP GETHTTP GET

onload

Nondeferred

HTTP GET

Page 35: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Web Browsing Process

35

User-controlled

Interaction

Environmentvariables → content negotiation

Client-controlledrepresentationchanges

HTTP GET Request for Resource R

HTTP 200 OK Response: R Content

Browser renders and displays R

JavaScript requests embedded resources

Server returns embedded resources

R updates its representation

Page 36: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Web Browsing Process

36

There is no longer “the”representation.

At any given time, users get “a” representation.

GeoIP: Washington, D.C.URI-R: http://www.wunderground.com/

GeoIP: Suffolk, VAURI-R: http://www.wunderground.com/

Page 37: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

The Internet Archive got everything, right?

37

Page 38: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Missing tiles, not interactive

38

Page 39: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

HTTP GET Request for Resource R

HTTP 200 OK Response: R Content

Browser renders and displays R

JavaScript requests embedded resources

Server returns embedded resources

R updates its representation

Web Browsing Process

39

Archival Tools stop here

Page 40: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

HTTP GET Request for Resource R

HTTP 200 OK Response: R Content

Browser renders and displays R

JavaScript requests embedded resources

Server returns embedded resources

R updates its representation

Web Browsing Process

40

Archival Tools stop here

Page 41: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

HTTP GET Request for Resource R

HTTP 200 OK Response: R Content

Browser renders and displays R

JavaScript requests embedded resources

Server returns embedded resources

R updates its representation

Web Browsing Process

41

Archival Tools stop here

Still not solved!

Page 42: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

42

Motivating Examples

Background Information

Research Questions

Measuring the Impact of JavaScript

Measuring Memento Quality

Crawling Deferred Representations

Future Work

Conclusions

Page 43: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Research Questions

RQ1. To what extent does JavaScript impact archival tools?

RQ2. How do we measure memento quality?

RQ3. How can we crawl, archive, and play back deferred representations?

43

Page 44: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

44

Motivating Examples

Background Information

Research Questions

Measuring the Impact of JavaScript

Measuring Memento Quality

Crawling Deferred Representations

Future Work

Conclusions

20152013

Page 45: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Zombies!

45

2008

2012

Page 46: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Measuring JavaScript

1,000 URIs from Twitter

1,000 URIs from Archive-itDataset available at http://www.cs.odu.edu/~jbrunelle/jsDataSet.txt

Capture with tools

Study the archivability

46“The impact of JavaScript on archivability”, 2015, International Journal of Digital Libraries

( )

Page 47: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Good

47

Page 48: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Good

48

Page 49: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Good

49

Page 50: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Meh

50

Page 51: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Meh

51

Page 52: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Bad

52

Page 53: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Bad

53

Page 54: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Bad

54

Page 55: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Bad

55

Page 56: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Bad

56

Page 57: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Leakage by archival tool

57Twitter has more leakage than Archive-It

Page 58: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Leakage by archival tool

58Wayback reduces leakage the most

Page 59: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Leakage -> Zombies

5912% increase in embedded mementos loaded via JavaScript

Page 60: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Leakage increasing over time

60Increased JavaScript -> increases in missing embedded resources

Page 61: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

61

• 73.1% of all missing embedded mementos are loaded via JavaScript

• 33% increase in missing embedded mementos from JavaScript between 2005-2012

Leakage increasing over time

Page 62: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

62

Motivating Examples

Background Information

Research Questions

Measuring the Impact of JavaScript

Measuring Memento Quality

Crawling Deferred Representations

Future Work

Conclusions

20152014

Page 63: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

63“Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014, International Journal of Digital Libraries, 2015

VS.

63

Page 64: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

“Live” XKCD

• Missing 17% of embedded resources

• Looks complete

64

Page 65: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

“Live” XKCD

• Take three resources:• Logo

• Main Comic

• Navigation Strip

• Relative importance?

• All present in “Live” XKCD

65

Page 66: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Damaging XKCD

• Created a local memento

• Removed the logo and navigation strip

• Now missing 29% of embedded resources

• Human assessment: looks OK

66

Page 67: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Damaging XKCD

• From our local memento

• Removed the Main Comic

• Now missing 24% of embedded resources

• Human assessment: Not a usable memento

67

Page 68: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Damaging XKCD

• From our local memento

• Removed the Main Comic

• Now missing 24% of embedded resources

• Human assessment: Not a usable memento

• Percent of missing embedded resources is not a suitable metric for memento quality

68

Page 69: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Image Importance

• Size (as percentage of all pixels)

69

Page 70: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Image Importance

• Size

• Position (in viewport?)

70

Page 71: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Image Importance

• Size

• Position

• Centrality (in the vertical or horizontal center?)

71

Page 72: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Missing CSS

• More important than thought

• Calculated the amount of content in each vertical third

• If >=80% in left column and missing CSS, CSS is important

• Only performed if stylesheets are missing

72

Page 73: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Methodology

• Defined Dm and Mm metrics

Mm = 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠

𝐴𝑙𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠

Dm = 𝑖=1

𝑛𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠𝑤 𝑖

𝑗=1

𝑛𝑎𝑙𝑙 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑗

• Used Amazon Mechanical Turkers to assess web user perception of quality

• Assessed Dm versus Mm in manually damaged pages

• Assessed Dm versus Mm in the archives

73

Page 74: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Turk Results

74

Live vs Manually Damaged Dm

Mementos from Internet Archive

Agreement with Dm

Mementos from Internet Archive

Agreement with Mm

50/50 Chance

Page 75: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Damage in the Archives

75

Internet Archive WebCite

Mementos with deferred representations have 13.5% higher damage rating

Page 76: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

76

Motivating Examples

Background Information

Research Questions

Measuring the Impact of JavaScript

Measuring Memento Quality

Crawling Deferred Representations

Future Work

Conclusions

2015 2016

Page 77: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

77

CurrentWorkflow

• Dereference URI-Rs• Archive representation• Extract embedded URI-Rs• Repeat

Page 78: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

78

Two-Tiered Crawling

“Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015

“Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016

Page 79: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

79

<script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives!

Current workflow not suitable for deferred representations

Use PhantomJS to run JavaScript, interact with the representation

Two-tiered crawling approach to optimize performance

Page 80: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

80

<script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives!

Current workflow not suitable for deferred representations

Use PhantomJS to run JavaScript, interact with the representation

Two-tiered crawling approach to optimize performance

More URI-Rs in the crawl frontier

Runs more slowly but more deeply

Page 81: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Comparing Performance

• Crawled 10,000 URI-RsDataset available at http://www.cs.odu.edu/~jbrunelle/wsdl/10kuris.txt

• Compare crawl speed & discovered frontier size• With and without classifier

• Code available at https://github.com/jbrunelle/classifyDeferred/

81

Page 82: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Performance: Frontier Size

82PhantomJS creates a 1.5x larger crawl frontier than Heritrix

Page 83: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Performance: Crawl Speed

83

Heritrix: ~2 URIs/second

PhantomJS: ~4 seconds/URI

Page 84: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Classifier

We are omitting a discussion about the classifier for deferred vs. nondeferred representations

Please see Section 7.4 in the dissertation for a detailed discussion

84

Page 85: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Descendants = States of deferred representations reached through client-side events

85

Click Pan Zoom

Click Pan Zoom

Page 86: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Crawling descendants

• Interactions represented as N-ary tree G

• FSM: M = (S, s0, Σ, δ)‒ S is the finite set of client states

‒ s0 ϵ S is the initial state reached by dereferencing the URI-R and executing the initial on-load events

‒ e ϵ Σ defines the client-side event e as a member of the set of all events Σ

‒ δ : Sx Σ → S is the transition function in which a client-side event is executed and leads to a new state

si, sj ϵ S

δ(si, e) = sj

e = client-side event

j = i + 186

“Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016

Page 87: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

87http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices

s0

s1

s2

Interaction Trees are 2 Levels Deep

Page 88: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

88http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices

s0

s1

s2

Interaction Trees are 2 Levels Deep

Page 89: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

89

Interaction Trees are 2 Levels Deep

http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices

s0

s1

s2

Page 90: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

90

Interaction Trees are 2 Levels Deep

http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices

s0

s1

s2

Page 91: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

91

Interaction Trees are 2 Levels Deep

http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices

s0

s1

s2

Page 92: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Expanding the Crawl Frontier

92

Level s1 provides the greatest benefit to the crawl frontier

Nondeferred

Deferred

Page 93: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Crawling Descendants

93

New embedded resources at levels s1 are largely unarchived

Page 94: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Crawling Descendants

94

Level s1 has the highest cost-benefit Return on Investment

Page 95: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Storage Impact of Two-Tiered Crawling

IIPC-proposed JSON metadata of interactions, resulting descendants

–Potentially used to resolve URI-M collisions

–16.5KB WARC metadata

–143MB for total dataset

11.4 times larger for deferred vs nondeferred

Totals 5.12 times more storage per URI-R for total dataset

95

2013

Page 96: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

96

Motivating Examples

Background Information

Research Questions

Measuring the Impact of JavaScript

Measuring Memento Quality

Crawling Deferred Representations

Future Work

Conclusions

Page 97: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Future Work

• Modeling user interactions, tendencies, and simulation– Form filling– Click and navigation likelihood

• Evaluating success of crawling deferred representations– Random walks through the archives– Dm vs Mm of mementos of deferred representations

• Archival Halting Problem: How much is enough?– Mapping Applications – How many pans and zooms gets all the Norfolk,

VA Google map tiles?– How many CNN.com pages get all the Google Ads?

• Playing back WARCs with IIPC metadata of deferred representations and descendants

97

Page 98: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

98

Motivating Examples

Background Information

Research Questions

Measuring the Impact of JavaScript

Measuring Memento Quality

Crawling Deferred Representations

Future Work

Conclusions

Page 99: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

RQ1. To what extent does JavaScript impact archival tools?Contributions:

• Defined and identified zombie resources

• Adoption of JavaScript correlates with missing embedded resources in mementos

• Defined deferred representations

• Showed that deferred representations have reduced archivability

99

2012: ws-dl.blogspot.com

2013: TPDL2013

2015: iPRES2015

2015: IJDL

2015: IJDL

Section 4.3

Ch. 5

Ch. 2

Ch. 5

For more information, reference:

Page 100: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

RQ2. How do we measure memento quality?

Contributions:

• Mm is not accurate (worse than coin-flip)

• Created Dm metric

• Dm is closer to user perception than Mm

• Mementos of deferred representations have higher Dm than nondeferred representations

100

2015: JCDL2015

2015: IJDL Special Issue

Ch. 6

Section 6.6

For more information, reference:

Page 101: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

RQ3. How can we crawl, archive, and play back deferred representations?Contributions:

• Defined a framework for archiving deferred representations

• Showed that the framework will crawl more slowly but more thoroughly

• Defined descendants, showed that they are 2-levels deep

• Showed the storage impact of crawling descendants and deferred representations

101

2015: iPRES2015

2016: arXiv:1601.05142

Ch. 7

Ch. 7

For more information, reference:

Page 102: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Summary

• Measured the impact of JavaScript on the archives

• Quantified damage caused by JavaScript

• Measured the cost in time and space to archive JavaScript

Provides policy makers information to make decisions regarding JavaScript handling in crawling and archiving

Quantified an intuitive understanding of crawling deferred representations at web scale

102

Page 103: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Backups

103

Page 104: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

104

Year RQ Venue Abbreviated Title Notes

2012 JCDL2012 Doctoral Consortium Capturing Dynamic Web

2013 JCDL2013 TimeMap Caching

2013 RQ1 TPDL2013 Archivability Over Time

2013 TPDL2013 Transactional Archiving

2013 RQ1 DLib Magazine 19(11/12) Identifying Mementos

2014 RQ2 JCDL2014 Measuring Memento Damage Best Student Paper

2015 RQ1 International Journal of Digital Libraries Measuring Impact of JavaScript

2015 RQ2 International Journal of Digital Libraries Measuring Memento Damage JCDL2015 Special Issue

2015 JCDL2015 Merging Mobile and Desktop Best Poster

2015 RQ3 iPRES2015 Two-Tiered Crawling

2016 RQ3 Technical Report, arXiv:1601.05142 Hypercube Model for Archiving

2016 DLib Magazine 22(1/2) Archiving Corporate Intranets

Publications

Page 105: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Publications• Justin F. Brunelle “Filling in the Blanks: Capturing the

Dynamic Web”, JCDL 2012 Doctoral Consortium

• Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013

• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013

• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013

• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 19(11/12), 2013.

• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014

• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL

• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015

• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015

• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016

• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLib Magazine, 22(1/2) 2016

105

Page 106: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Mobile Mink: Merging Mobile and

Desktop Archived WebsWesley Jordan, Mat Kelly, Justin F. Brunelle,

Laura Vobrak, Michele C. Weigle, Michael L. Nelson

This work supported in part by the NEH HK-50181. This work

was performed as part of Wesley Jordan’s mentorship at The

MITRE Corporation. The author’s affiliation with The MITRE

Corporation is provided for identification purposes only, and is

not intended to convey or imply MITRE’s concurrence with, or

support for, the positions, opinions or viewpoints expressed by

the author.

Acknowledgements

http://bitly.com/MobileMink/

More about Mobile Mink

Desktop URIs are much

more prevalent than their

mobile counterparts in the

archives because crawlers

use desktop user-agent

strings.

Corresponding Mobile URIs

are archived less frequently

even though the

representations are different

than their desktop

counterparts.

http://espn.go.com/ http://m.espn.go.com/

Same

ESPN,

different

URIs,

different

HTML,

different

TimeMaps.

.

Browse to a URI-R

Potential content-

negotiation from

user-agent

Access tool from the

“Share” menu

MobileMink merges TimeMaps of

http://espn.go.com & http://m.espn.go.com/

Desktop and mobile webs differ and

the linkage between them is lost in the

archives

Discovers mobile and

desktop URI-Rs

Uses Memento to get

all available

TimeMaps

Provides integrated

TimeMap

Offers users ability

to submit mobile

and desktop URI-Rs

to archives

Increases

coverage of mobile

URI-Rs in the

archives

Page 107: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

HTTP Request

$ curl -i -v http://www.cs.odu.edu/

> GET / HTTP/1.1

> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2

> Host: www.cs.odu.edu

> Accept: */*

>

< HTTP/1.1 200 OK

< Server: nginx

< Date: Tue, 25 Mar 2014 23:42:38 GMT

< Content-Type: text/html

< Transfer-Encoding: chunked

< Connection: keep-alive

<

107

Page 108: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

HTTP Response

HTTP/1.1 200 OK

Server: nginx

Date: Tue, 25 Mar 2014 23:40:09 GMT

Content-Type: text/html

Transfer-Encoding: chunked

Connection: keep-alive

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<!-- saved from url=(0036)http://www.cs.odu.edu/newcssite/new/ -->

<!-- saved from url=(0019)http://sci.odu.edu/ -->

<HTML xmlns:st1 = "urn:schemas-microsoft-com:office:smarttags">

<HEAD>

<meta name="verify-v1" content="CXMn8RoyhZpl9fsKpbgxtiFw3kIdHD51r/ntbf1Rrcw=" >

<TITLE>Department Of Computer Science</TITLE>

108

Page 109: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Client-side code modifiesthe DOM

109

Page 110: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Internet Archive URI-M

110

http://web.archive.org/web/20140314130018/http://espn.go.com/

Archive Prefix Memento-DateTime URI-R

Page 111: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Deferred Representations

Representation is incomplete

Client-side code execution completes the build of the representation

111

Page 112: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Web Browsing Process

112

Deferredrepresentations

Page 113: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Percent Missing vs. Weighted Damage

• 𝑀𝑀 = Percent of embedded resources missing

𝑀𝑀 =𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔

𝑇𝑜𝑡𝑎𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠

• 𝐷𝑀 = Damage rating of missing embedded resources

𝐷𝑀 =𝐷𝑀𝐴𝑐𝑡𝑢𝑎𝑙𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙

𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 = 𝑖=1

𝑛[𝐼|𝑀𝑀]𝐷[𝐼|𝑀𝑀] (𝑖)

𝑛[𝐼|𝑀𝑀]+ 𝑖=1

𝑛[𝐶]𝐷[𝐶] (𝑖)

𝑛𝐶 113

𝐼 = 𝐼𝑚𝑎𝑔𝑒

𝑀𝑀 = 𝑀𝑢𝑙𝑡𝑖𝑀𝑒𝑑𝑖𝑎

𝐶 = 𝐶𝑆𝑆

Page 114: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

• Measured Internet Archive mementos

• Damage generally improves over time

• Despite missing more resources over time

Damage in the Internet Archive

114

Page 115: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Expanding the crawl frontier

115

Click events lead to the most descendants

Page 116: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Related Work

116

Page 117: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Deep Web

• Deferred=Deep (Bergman, 2001)

• Mobile requires context (Schneider, 2013)

• Static → Dynamic Web (Rosenthal, 2011)(IIPC, 2012)

• Crawlers & deep Web (Ast, 2008) (B. He, 2007) (Y. He, 2013)

• Google’s deep Web crawler (Madhavan, 2008)

• Forms (Ntoulas, 2005)

117

Page 118: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Archive Quality

• SHARC, Quality Conscious Archiving (Spaniol, 2009)

• Quality of archives (Spaniol, 2009, 2009)

• Archiveready (Banos, 2013, 2015)

• Acid test (Kelly, 2014)

• Block Importance (Ye, 2003) (Fersini, 2008) (Kohlschutter, 2010)

118

Page 119: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Monitoring for Security

• Ripley (Vikram, 2009)

• Mugshot (Mickens, 2010)

• ActionShot (Li, 2010)

• Ajax testing and states (Mesbah, 2007, 2008, 2009, 2009, 2012)

• Crawling Ajax (Dincturk, 2013, 2014)

119

Page 120: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

PublicationsMaster’s:

• Kyle Dempsey, Justin Brunelle, G. Tanner Jackson, Chutima Boonthum, Irwin Levinstein, Danielle McNamara. “MiBoard: Multiplayer Interactive Board Game”, AIED2009

• Justin F. Brunelle, Irwin B. Levinstein, Chutima Boonthum. “MiBoard: Metacognitive Training Through Gaming in iSTART”, 2009 VMASC Capstone Conference

• Best paper in track

• Justin F. Brunelle, Kyle B Dempsey, G. Tanner Jackson, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “MiBoard: Metacognitive Training Through Gaming”, SCiP2009

• Justin F. Brunelle, G. Tanner Jackson, Kyle Dempsey, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “Analysis of MiBoard as an iSTARTPractice Tool”, FLAIRS-24, 2010

• Kyle Dempsey, G. Tanner Jackson, Justin Brunelle, Michael Rowe, Danielle McNamara. “MiBoard: Assessing Collaborative Learning Through Game-Based Practice”, FLAIRS-24, 2010

PhD:

• Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL 2012 Doctoral Consortium

• Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013

• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013

• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013

• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 19(11/12), 2013.

• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014

• Best Student Paper, International Journal of Digital Libraries: JCDL2015 Special Issue

• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL

• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015

• Best Poster

• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015

• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016

• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLibMagazine, 2016

120

Page 121: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Performance with classifier

121

Page 122: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Mobile Sites in the Archives

122

http://m.espn.go.com/wireless/http://espn.go.com/

“A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 2013

Page 123: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Mobile Sites in the Archives

123

http://m.espn.go.com/wireless/http://espn.go.com/

URI-M:

http://web.archive.org/web/20140330125315/http://espn.go.com/

URI-M:

http://web.archive.org/web/20140330125414/http://m.espn.go.com/wireless/

Page 124: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Collisions in the Archives

124

http://www.cnn.com/

URI-M? URI-T?

http://web.archive.org/web/[DATETIME]/http://www.cnn.com/

Page 125: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Need a better way to index mementos

• URI-R is no longer enough

• Environmental factors:‒ Content negotiation

‒ Interaction

‒ Personalization

‒ GeoIP

125

Page 126: Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Representations

Content Negotiation

Server-side interpretation of client-provided parameters

Multiple representations, single resource

126

Resource

URI Representation 2Represents

Representation 1

Represents

Identifies

Content Negotiation

Mobile

Desktop

user-agent