web archiving tools and technologies
DESCRIPTION
for the Web Archiving workshop at IS&T Archiving 2013 in Washington, DCTRANSCRIPT
Web ArchivingTools and Technology
Dan Chudnov - GWU Librariesdchud at gwu edu
@dchudIS&T Workshop, April 2, 2013
Washington DC USA
Tuesday, April 2, 13
select scope crawl process access
unt nom tool
X X
heritrix X X
wct X X X X
netarchivesuite X X X X X
warc tools X
nutchwax X X
wayback X
Tuesday, April 2, 13
select
•what to collect
•who authorizes
•when
•what order
Tuesday, April 2, 13
scope
•how much
• robots.txt
•what to leave out
•which doors not to open
Tuesday, April 2, 13
crawl• start with seeds
• find, queue, follow links
• be kind to each site
• parallelize across sites
• schedule, log, checkpoint, resume
• bundle
Tuesday, April 2, 13
process• lump, split, bundle,
rebundle
• quality control
• index, surrogate, reorder, prep for access
• store, distribute, preserve
Tuesday, April 2, 13
access
• browse
• search
• known items
• patterns
• needles
Tuesday, April 2, 13
select scope crawl process access
unt nom tool
X X
heritrix X X
wct X X X X
netarchivesuite X X X X X
warc tools X
nutchwax X X
wayback X
Tuesday, April 2, 13
UNT URL Nomination Tool•collaborative
selection
•collect seed lists
•attach metadata
•agree on scope
• feed crawlers
Tuesday, April 2, 13
heritrix
• free software from Internet Archive
• easy to start with
• difficult to master
• powerful, configurable, confusing
Tuesday, April 2, 13
heritrix cont’d• two major versions, “1” and “3”
• WCT and NetArchive embed “1”
• “1” - minimal UI
• “3” - even less
• iterate early - long learning curve
• best available tool
Tuesday, April 2, 13
heritrix cont’d
Tuesday, April 2, 13
Web Curator Tool• free software from
NLNZ / BL
• full crawling workflow suite
• select, obtain permissions, authorize
• schedule, crawl w/heritrix 1
Tuesday, April 2, 13
WCT cont’d
• quality review
• statistics, hierarchy visualization, pruning
• troubleshooting
• task notifications
• reporting
Tuesday, April 2, 13
WCT cont’d
Tuesday, April 2, 13
NetarchiveSuite
• free software from netarkivet.dk
• used by State and University Library, The Royal Library in Denmark
• complete solution from selection to access
Tuesday, April 2, 13
NetarchiveSuite cont’d
Tuesday, April 2, 13
NetarchiveSuite cont’d• selection, scoping,
scheduling
• crawling, troubleshooting, tweaking
• system dashboard, quality assurance
• heritrix and wayback
Tuesday, April 2, 13
warc-tools
• command-line tools for arc/warc
• validate, summarize, filter
• bundle / rebundle, convert, index
Tuesday, April 2, 13
NutchWax
• free software
• index / search of ARC data
• development slowed / stopped but still used
Tuesday, April 2, 13
searchingweb archives
is hard
Tuesday, April 2, 13
wayback
• free software from Internet Archive
• public web access to web archives
• what you’ve seen at archive.org
Tuesday, April 2, 13
wayback cont’d
Tuesday, April 2, 13
wayback cont’d
Tuesday, April 2, 13
Tuesday, April 2, 13