web archiving tools and technologies

25
Web Archiving Tools and Technology Dan Chudnov - GWU Libraries dchud at gwu edu @dchud IS&T Workshop, April 2, 2013 Washington DC USA Tuesday, April 2, 13

Upload: dan-chudnov

Post on 20-May-2015

748 views

Category:

Technology


5 download

DESCRIPTION

for the Web Archiving workshop at IS&T Archiving 2013 in Washington, DC

TRANSCRIPT

Page 1: web archiving tools and technologies

Web ArchivingTools and Technology

Dan Chudnov - GWU Librariesdchud at gwu edu

@dchudIS&T Workshop, April 2, 2013

Washington DC USA

Tuesday, April 2, 13

Page 2: web archiving tools and technologies

select scope crawl process access

unt nom tool

X X

heritrix X X

wct X X X X

netarchivesuite X X X X X

warc tools X

nutchwax X X

wayback X

Tuesday, April 2, 13

Page 3: web archiving tools and technologies

select

•what to collect

•who authorizes

•when

•what order

Tuesday, April 2, 13

Page 4: web archiving tools and technologies

scope

•how much

• robots.txt

•what to leave out

•which doors not to open

Tuesday, April 2, 13

Page 5: web archiving tools and technologies

crawl• start with seeds

• find, queue, follow links

• be kind to each site

• parallelize across sites

• schedule, log, checkpoint, resume

• bundle

Tuesday, April 2, 13

Page 6: web archiving tools and technologies

process• lump, split, bundle,

rebundle

• quality control

• index, surrogate, reorder, prep for access

• store, distribute, preserve

Tuesday, April 2, 13

Page 7: web archiving tools and technologies

access

• browse

• search

• known items

• patterns

• needles

Tuesday, April 2, 13

Page 8: web archiving tools and technologies

select scope crawl process access

unt nom tool

X X

heritrix X X

wct X X X X

netarchivesuite X X X X X

warc tools X

nutchwax X X

wayback X

Tuesday, April 2, 13

Page 9: web archiving tools and technologies

UNT URL Nomination Tool•collaborative

selection

•collect seed lists

•attach metadata

•agree on scope

• feed crawlers

Tuesday, April 2, 13

Page 10: web archiving tools and technologies

heritrix

• free software from Internet Archive

• easy to start with

• difficult to master

• powerful, configurable, confusing

Tuesday, April 2, 13

Page 11: web archiving tools and technologies

heritrix cont’d• two major versions, “1” and “3”

• WCT and NetArchive embed “1”

• “1” - minimal UI

• “3” - even less

• iterate early - long learning curve

• best available tool

Tuesday, April 2, 13

Page 12: web archiving tools and technologies

heritrix cont’d

Tuesday, April 2, 13

Page 13: web archiving tools and technologies

Web Curator Tool• free software from

NLNZ / BL

• full crawling workflow suite

• select, obtain permissions, authorize

• schedule, crawl w/heritrix 1

Tuesday, April 2, 13

Page 14: web archiving tools and technologies

WCT cont’d

• quality review

• statistics, hierarchy visualization, pruning

• troubleshooting

• task notifications

• reporting

Tuesday, April 2, 13

Page 15: web archiving tools and technologies

WCT cont’d

Tuesday, April 2, 13

Page 16: web archiving tools and technologies

NetarchiveSuite

• free software from netarkivet.dk

• used by State and University Library, The Royal Library in Denmark

• complete solution from selection to access

Tuesday, April 2, 13

Page 17: web archiving tools and technologies

NetarchiveSuite cont’d

Tuesday, April 2, 13

Page 18: web archiving tools and technologies

NetarchiveSuite cont’d• selection, scoping,

scheduling

• crawling, troubleshooting, tweaking

• system dashboard, quality assurance

• heritrix and wayback

Tuesday, April 2, 13

Page 19: web archiving tools and technologies

warc-tools

• command-line tools for arc/warc

• validate, summarize, filter

• bundle / rebundle, convert, index

Tuesday, April 2, 13

Page 20: web archiving tools and technologies

NutchWax

• free software

• index / search of ARC data

• development slowed / stopped but still used

Tuesday, April 2, 13

Page 21: web archiving tools and technologies

searchingweb archives

is hard

Tuesday, April 2, 13

Page 22: web archiving tools and technologies

wayback

• free software from Internet Archive

• public web access to web archives

• what you’ve seen at archive.org

Tuesday, April 2, 13

Page 23: web archiving tools and technologies

wayback cont’d

Tuesday, April 2, 13

Page 24: web archiving tools and technologies

wayback cont’d

Tuesday, April 2, 13

Page 25: web archiving tools and technologies

Tuesday, April 2, 13