web archiving tools and technologies

Post on 20-May-2015

748 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

for the Web Archiving workshop at IS&T Archiving 2013 in Washington, DC

TRANSCRIPT

Web ArchivingTools and Technology

Dan Chudnov - GWU Librariesdchud at gwu edu

@dchudIS&T Workshop, April 2, 2013

Washington DC USA

Tuesday, April 2, 13

select scope crawl process access

unt nom tool

X X

heritrix X X

wct X X X X

netarchivesuite X X X X X

warc tools X

nutchwax X X

wayback X

Tuesday, April 2, 13

select

•what to collect

•who authorizes

•when

•what order

Tuesday, April 2, 13

scope

•how much

• robots.txt

•what to leave out

•which doors not to open

Tuesday, April 2, 13

crawl• start with seeds

• find, queue, follow links

• be kind to each site

• parallelize across sites

• schedule, log, checkpoint, resume

• bundle

Tuesday, April 2, 13

process• lump, split, bundle,

rebundle

• quality control

• index, surrogate, reorder, prep for access

• store, distribute, preserve

Tuesday, April 2, 13

access

• browse

• search

• known items

• patterns

• needles

Tuesday, April 2, 13

select scope crawl process access

unt nom tool

X X

heritrix X X

wct X X X X

netarchivesuite X X X X X

warc tools X

nutchwax X X

wayback X

Tuesday, April 2, 13

UNT URL Nomination Tool•collaborative

selection

•collect seed lists

•attach metadata

•agree on scope

• feed crawlers

Tuesday, April 2, 13

heritrix

• free software from Internet Archive

• easy to start with

• difficult to master

• powerful, configurable, confusing

Tuesday, April 2, 13

heritrix cont’d• two major versions, “1” and “3”

• WCT and NetArchive embed “1”

• “1” - minimal UI

• “3” - even less

• iterate early - long learning curve

• best available tool

Tuesday, April 2, 13

heritrix cont’d

Tuesday, April 2, 13

Web Curator Tool• free software from

NLNZ / BL

• full crawling workflow suite

• select, obtain permissions, authorize

• schedule, crawl w/heritrix 1

Tuesday, April 2, 13

WCT cont’d

• quality review

• statistics, hierarchy visualization, pruning

• troubleshooting

• task notifications

• reporting

Tuesday, April 2, 13

WCT cont’d

Tuesday, April 2, 13

NetarchiveSuite

• free software from netarkivet.dk

• used by State and University Library, The Royal Library in Denmark

• complete solution from selection to access

Tuesday, April 2, 13

NetarchiveSuite cont’d

Tuesday, April 2, 13

NetarchiveSuite cont’d• selection, scoping,

scheduling

• crawling, troubleshooting, tweaking

• system dashboard, quality assurance

• heritrix and wayback

Tuesday, April 2, 13

warc-tools

• command-line tools for arc/warc

• validate, summarize, filter

• bundle / rebundle, convert, index

Tuesday, April 2, 13

NutchWax

• free software

• index / search of ARC data

• development slowed / stopped but still used

Tuesday, April 2, 13

searchingweb archives

is hard

Tuesday, April 2, 13

wayback

• free software from Internet Archive

• public web access to web archives

• what you’ve seen at archive.org

Tuesday, April 2, 13

wayback cont’d

Tuesday, April 2, 13

wayback cont’d

Tuesday, April 2, 13

Tuesday, April 2, 13

top related