safety nets: rescue and revival for endangered born-digital records- program on information science ...

91
Jefferson Bailey, Director, Web Archiving, Internet Archive MIT Program on IS, Brownbag Series | Cambridge MA 2017 | @jefferson_bail | [email protected] SAFETY NETS: RESCUE & REVIVAL FOR ENDANGERED WEB RECORDS

Upload: micah-altman

Post on 29-Jan-2018

154 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Jefferson Bailey, Director, Web Archiving, Internet ArchiveMIT Program on IS, Brownbag Series | Cambridge MA 2017 | @jefferson_bail | [email protected]

SAFETY NETS: RESCUE & REVIVAL FOR ENDANGERED WEB RECORDS

Page 2: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Outline

● About Internet Archive

● Timeline of Web Archiving at IA

● Web as Historical & Archival Record

● Collecting & Collections

● Technologies & Challenges

● Research & Services

● Conclusion

Page 3: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

The Internet ArchiveNon-Profit Library

Founded in 1996 by Brewster Kahle

Universal Access to All Knowledge

Page 4: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

35,000,000,000,000,000 Bytes Archived(35 Petabytes)

Page 5: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Books Digitization

Page 6: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Music Digitization

Page 7: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

TV Collection

Page 8: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Software Preservation and Emulation

Page 9: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

25,0002,000,0002,300,0002,400,0003,000,0004,000,000

570,000,000,000+

Software TitlesMoving ImagesBook ArchiveAudio RecordingsHours of TelevisioneBooksURLs

Page 10: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

20 Years of Archiving the Web

Page 11: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

1996 US Presidential Campaigns with Smithsonian

218,342,520Web Captures

Page 12: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

1997 First Full Crawl

525,362,846Web Captures

Page 13: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

1998 Donation of Crawl to the Library Of Congress

1,166,891,826Web Captures

Page 14: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2000US Presidential Campaigns with the Library of CongressStarted Collecting TelevisionStarted Digitizing Movies

6,153,042,235Web Captures

Page 15: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2001Launch of the WayBack Machine

12,082,859,018Web Captures

Page 16: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2002Book Digitization Begins

22,277,788,816Web Captures

Page 17: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2003International Internet Preservation Consortium Founded

38,868,116,181Web Captures

Page 18: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2006Archive-It Started

103,943,903,726Web Captures

Page 19: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2007Ireland

184,277,909,308Web Captures

Page 20: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2008National Archive & Records Administration (NARA)Congressional Harvests (https://webharvest.gov)

209,160,715,829Web Captures

Page 21: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2009Archive-It Adds its 100th Partner7 National Library Partners

225,658,093,516Web Captures

Page 22: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2010Broad and Survey Web-Scale Crawls

246,744,306,660Web Captures

Page 23: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2014Emulation of Software Archive in the Browser

452,769,236,649Web Captures

Page 24: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

2016Archive-It Adds its 500th Partner

467,195,419,069

Web Captures

Page 25: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 26: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

The Web as Historical Resource

Page 27: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

The Web as Historical Resource

Page 28: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

The Web as Historical Resource

https://web.archive.org/details/http://web.mit.edu/

Page 29: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

The Web as Historical Resource

● The web is the primary publishing platform of our generation

● The web has consumed all media

● The web is distributed in access, but centralized in publication

● One cannot study contemporary society (or even recent history) without studying the web

Page 30: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

The Web as Archival Resource● WARC format

○ Obtuse, technical

● Packaging

○ URL-centric clicking &

browsing through Wayback

○ Query-based retrieval and

search functionality

○ Silos

● Born-Digital is same

Page 31: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

The Web as Archival Resource

● The web as “lived experience”

● Plurality of representation

● Diversity of media

● Unrivaled scope, scale, extent

● Access can be native (in the browser, full-text search, etc)

Page 32: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Collecting & Collections

Web ScaleCurated

Collaborative

Page 33: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Hundreds of crawls per day | 1 Billion web documents per week | 15 PB total (3 PB / year)

Global Scale Harvesting

Page 34: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Archive-ItCurated, Selective Web Archiving

Page 35: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 36: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Topical or Thematic Collections

Page 37: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 38: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Community Webs Program

• LB21 grant, National Digital Platform

• Continuing Education, Curating Collections

• 2-year project (Jun 2017 - May 2019)

• Additional funding from Kahle Austin Foundation

Page 39: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Education & Training● Establish a cohort network of public librarians doing

web archiving to preserve local history● Support further cohort building, professional

development activities, and outreach

Collection Development● Create open, dedicated training and OER materials

on community memory web archiving● Seed innovative local programing and partnerships

Expanding National Capacity● Provide web archiving services and infrastructure

and ongoing storage and access● Build an extensible program model that can be

scaled and applied to other domains

Community Webs Goals

Page 40: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Community Webs Applications

110 applicationsfrom public libraries across the country

A cohort of 28small, medium, and large

public libraries

15 IMLS Participants

13 Kahle Austin Participants

Page 41: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

★ Athens Regional Library System★ Birmingham Public Library★ Brooklyn Collection, Brooklyn Public Library★ Buffalo & Erie County Public Library★ Cleveland Public Library★ Columbus Metropolitan Library★ DC Public Library★ East Baton Rouge Parish Library★ Forbes Library (MA)★ Grand Rapids Public Library★ Henderson District Public Libraries (HDPL)★ Kansas City Public Library★ County of Los Angeles Public Library, ★ Marshall Lyon County Library (MN)

★ Metropolitan Library System (OK)★ New Brunswick Free Public Library★ Patagonia Library (AZ)★ Pollard Memorial Library (Lowell, MA)★ Queens Public Library★ Lawrence Public Library★ San Diego Public Library★ San Francisco Public Library★ NYPL - Schomburg Center for

Research in Black Culture★ The Urbana Free Library★ West Hartford Public Library★ Westborough Public Library★ Denver Public Library, Western History

and African American Research Library

Participants

Page 42: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Collecting: With Researchers

News Measures Research Project● 663 local news sites representing 100 communities ● 7 crawls for a composite Week (July - September 2016)● 2.3TB & 17 million URLs captured● Post-project ongoing monthly crawls● Access to the collection, https://archive-it.org/collections/7520● Research datasets publicly available: soon! (watch IA blog)● Work with us to save news for research & posterity!

Page 43: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 44: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 45: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Collecting: Kids, Scholars, Ourselves● K-12 Web Archiving Program

● Katrina Blogs○ WBM as research tool○ AIT as archival citation○ Retroactive special collections○ http://bit.ly/katrina-blogs

● Open-access Scholarship○ PFDs in WBM (1.6 billion)○ Cross-referencing against OA

registries, repos, ISSNs, DOIs, lists, etc

Page 46: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Archiving .govThe End of Term Web Archive

Page 47: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

defining the “government web presence”

Stanford WebBase Project

2004 crawl list of URLs

Page 48: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

eot 2016: more partners!

Federal Government Web Archiving Working Group

Page 49: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

End of Term Web Archive 2016

2008: 457 from 26 nominators

2012: 1476 from 31 nominators

2016: 15,000+ from 400+ nominators (via UNT form)

Plus!: Over 150,000 from DataRescue/EDGI events/tools

Page 50: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

End of Term Web Archive 2016

Page 51: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

End of Term Web Archive http://eotarchive.cdlib.org/

Page 52: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

• Started with:• 9,000+ social media

accounts (scrape of gov SM registry API) 44% FB, 37% TW, 10% YT

• ~190K total domain, subdomains, gov/non-gov sites

• more crowdsourced, curatorial nominations

End of Term Web Archive 2016

• Ended with:• 200+ terabytes of data• 350,000,000+ docs/files• 70,000,000+ html files• 40,000,000+ PDFs• 100,000 public nominations• LOC, GPO, NARA, GSA,

NASA, EPA, others

Page 53: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

https://archive.org/details/MilitaryIndustrialPowerpointComplexEvery Powerpoint from the .mil web domain (~50K) converted to PDF and with FTS

Special Web Sub Collections

Page 54: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Special Web Sub Collections

http://archive.org/~vinay/20th-century-gov-groupshots.html

Page 55: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Special Web Sub Collections

GifCities! https://gifcities.org | Project done for Internet Archive’s 20th Anniversary, October 2016Project Team: Vinay Goel, Richard Caceres, Jefferson Bailey + IA archivists!

Page 58: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Special Web Sub Collections

(coming soon!)

Page 59: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Technologies & Challenges

Page 60: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

● Variations in acquisition

● Complex format

● Tons o’ data +++++++

● Storage infrastructure

● Computational infrastructure

● Diversity of contained data

Technical Challenges

Page 61: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

● Acquisition & provenance opacity

● The Never-ending Web

● Crawl configurations

● Moving target; volatility

● Traditional finding aids inadequate

● Access is technologically dependent

● Lure of evermore data; more data not “better” data

● Attestation issues and a higher sensitivity to elision

Conceptual Challenges

Page 62: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

● Web archiving + born-digital is still a somewhat niche collecting activity

● Lack of coordinated efforts on shared tooling

● Little familiarity with formats, software, or processes

● Few on-ramps for non-developer and developer participation

● Web archives can answer any question

Community Challenges

Page 63: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

• “Systems Interoperability and Collaborative

Development for Web Archives”

•National Leadership Grant, National Digital

Platform, R&D

•IA/AIT (PI), Stanford, UNT, Rutgers

•2-year project started January 2016

•National Symposium Feb 2017

WASAPI: Web Archiving Systems APIs

Page 64: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

WASAPI: Web Archiving Systems APIsThree Key Areas of R&D:

1) What are the attributes of a community model that can

support sustainable and broad-based collaborative

web archiving technology development?

2) What are the community needs and downstream uses

for the planned Export APIs (by AIT & LOCKSS) to

facilitate transfer of web archive data between

distributed systems and what other prospective APIs

does it point to?

3) How can better interoperability of web archiving

systems support new forms of access and research

use?

Page 65: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

You can now search the

shall we...https://web-beta.archive.org/

Searching: WBM (keywords)

Page 66: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 67: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Searching: WBM (keywords)● How it works:○ Index is built on anchor text of all in-

bound links to a homepage○ Index text covers 443 million

homepages and is drawn from 900B in-bound links from other cross-domain websites

Page 68: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

You can now search the details

shall we...

https://web-beta.archive.org/__wb/search/metadata?q=host:mit.edu

Searching: WBM (profile)

Page 69: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

BROZZLER!

“browser” | “crawler” = BROZZLER

Page 70: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 71: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 72: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Research & Services

Page 73: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Advancing New Uses● Web Archives

○ rich in content

○ rich in longitudinal value

○ rich sources for data mining

● Current Access Models

○ URL-centric clicking &

browsing through Wayback

○ Query-based retrieval and

search / discovery

Page 74: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Researchers

Want data

Interested in change over time

Study semantics, entities, locations

Multidisciplinary

Value info about collection origins

Web Archives

Have a lot of data

Have data segmented over time

Have a rich diversity of content

Multidisciplinary

Chock full of provenance information

Page 75: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

● Expand access models for web archives + born-digital

● Enable new insights into collections

● Leverage IA (or other) infrastructure for large-scale

processing to produce datasets for research

● Facilitate computational analysis and new use cases

● Increase use, visibility, and value of Archive-It partner

collections

Goal of Research Services

Page 76: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Flexible Research Services

Researchers do not necessarily need huge sets of data to do interesting work… they do need flexible data delivery services…. Different formats based on different searches for different kinds of research at different times.

V.E. Varvel Jr. & A. Thomer, Google Digital Humanities Awards recipient interviews report

Page 77: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY
Page 78: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Archive-It Research Serviceshttp://bit.ly/ait_ars

Web Archive Derivative Datasets

Page 79: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

APIs, Notebooks, Interfaces

Page 80: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

APIs, Notebooks, Interfaces

Page 81: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Historical ccTLD Wayback Machines! Built on IA global crawls + added historic web data

With keyword and mime/format search, embed linkback, domain stats, and special features

Page 82: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Accessing: Data/Datathons● White House Datathon

○ Worked with White House & hosted event● Archives Unleashed - http://archivesunleashed.com/

○ AU 3.0 -- At IA as part of WASAPI symposium● WARCshop

○ PSU workshop for archivists to support research● Webinar on using APIs (for SAA, videos soon)● Online workshop + notebooks

○ https://github.com/vinaygoel/ars-workshop

Page 83: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

“The .GOV Internet Archive: A Big Data Resource for Political Science”Gade, et al., The Political Methodologist

Page 84: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Colors of a (disappeared) National WebAnat Ben-David (Digital Soci/PoliSci)

Page 85: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Exploring the Canadian Political Parties + Geocities(Ian Milligan, Digital Historian)

Page 86: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

web collection

web collection

web collection

web collection

Custodian hardware/cloud

Comm CloudAWS, Goog,

Azure, Wolfram, etc

Academic HPC

APIs

disks

Local + tooling/analytics

derivs

tools

platforms

Seeds, WARCs, Derivative Datasets, Publications, Research Data

APIs

Page 87: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Research Services Approaches

● Datasets to researchers, patrons, developers

● Minimize need for dedicated infrastructure

○ leverage custodian computing power and archival

expertise

● Hide complexities and volume of datasets and of born-

digital collections through derivative formats

● Ongoing development of platforms and APIs for research

data analysis and manipulation

Page 88: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Conclusion I

● Web archives are the present and future of historical

records and include all media types

● No future scholarship can study our era without

considering materials published (only) on the web

● Web archives will unsettle prior methodological

approaches

● But web archives will offer new potential in research,

both scholarly inquiry and public recreation

Page 89: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Conclusion II

To advance research use of web archives and

born-digital historical records (i.e the archives

of now) will require greater comfort, by

archivists and by historians, with technical

mediation at multiple levels and the increasing

distance between the granularity and totality

of the objects and subjects of study

Page 90: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

Conclusion Last

● Building ‘safety nets’ for born-digital resources

will depend on the adoption of new

technologies, new practices, new collections,

and new research services/methods.

● The results will be the ongoing primacy and

utility of the archive record and the continued

vitality and resiliency of historical scholarship.

Page 91: SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science  talk by JEFFERSON BAILEY

THANKS!

Jefferson Bailey, Director of Web Archiving

[email protected] | @jefferson_bail

Internet Archive, https://archive.org

Archive-It, https://archive-it.org