tufts spatial data rescue: crawling at-risk government data

22
Tufts Spatial Data Rescue: Crawling at-risk Government Data Kyle Monahan Statistics and Research Technology Specialist Tufts University FOSS4G | Boston, MA | 8/17/2017

Upload: kyle-monahan

Post on 22-Jan-2018

10 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Tufts Spatial Data Rescue: Crawling at-risk Government

DataKyle Monahan

Statistics and Research Technology SpecialistTufts University

FOSS4G | Boston, MA | 8/17/2017

Page 2: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Background

•What is a data rescue?

• Methods and techniques to identify, store and preserve datasets

• Predominantly data associated with government entities• *.gov, *.mil, *.edu, *.org, etc.

• Especially critical during election transitions

12/23/2017 FOSS4G Conference | Boston, MA 2

Page 3: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Background - History

12/23/2017 KMM 3

Page 4: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Example: 2008 End of Term Harvest

12/23/2017 FOSS4G Conference | Boston, MA 4

• National Archives and Records Administration announced they would be unable to rescue data as they did in 2004.

• International Internet Preservation Consortium (IIPC) responded by organizing a crawl:• California Digital Library• Internet Archive• Government Printing Office• Library of Congress• University of North Texas

• Goal: “comprehensive harvest” (EOTerm Archive, 2016)

Page 5: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Example: 2008 End of Term Harvest

12/23/2017 FOSS4G Conference | Boston, MA 5

•Consisted of three main crawls: • Pre-election

• Post-election

• Post-inauguration

•Produced over 16 TB of data

•And 160,211,356 URIs (Phillips, 2016)End of Term, 2008

Page 6: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 6

•Access to up-to-date Federal data is critical for our Data Lab

•GIS and statistics classes rely on federal data (e.g. US Census, TRI, HUD)

Page 7: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 7

• Inquired about key data for faculty and staff at Tufts

•Also reached out to the Open Geoportal community

•Created a list of critical data sources that enable research and learning at Tufts and beyond

Google Docs

OGP Outreach

Page 8: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 8

•Used an FTP program called Filezilla

•Can re-initiate connections after failure

•Ran on multiple computers overnight, set to mirror different FTP sites.

Page 9: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 9

•Also completed collection of data from speed-limited locations by traditional mail

•Placed 128 GB flash drive in an envelope• Caught in a storm, but still much

faster than dial-up speed

Page 10: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Summary of Results

12/23/2017 FOSS4G Conference | Boston, MA 10

16

31

40

0

10

20

30

40

50

2008 2012 2017

Dat

a R

ecu

sed

, TB

Year of Harvest

Data Rescue, Estimated Harvest

From Tufts alone – likely much higher for all data rescues!Source: Phillips, 2016

Page 11: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Results of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 11

Word Cloud (highest frequency terms)

Page 12: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Development of Tufts Crawler

12/23/2017 FOSS4G Conference | Boston, MA 12

•High volume of data – much of it zipped• Some further compressed inside zip

files

•Needed a lightweight tool to assess what data was captured• Solution Python script

Page 13: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Development of Tufts Crawler

12/23/2017 FOSS4G Conference | Boston, MA 13

•Packaged the Python script in a GUI using Tkinter• Object-oriented layer on Tcl/Tk

•Allows for users unfamiliar with Python to use the tool

•Provides a simple interface and clear results

Page 14: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Results of Tufts Crawler

12/23/2017 FOSS4G Conference | Boston, MA 14

Unzips files

Records type of file

Organizes XML data

Page 15: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Using the Tufts Crawler – Take a Look

12/23/2017 FOSS4G Conference | Boston, MA 15

Page 16: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Summary & Future Work

12/23/2017 FOSS4G Conference | Boston, MA 16

• Tufts identified federal data perceived “at-risk”• Harvested over 40 TB of data, mostly

compressed

• Developed Tufts Crawler to unpack and categorize types, sizes and other metadata.

• Future work: pack into .exe, estimate progress bar

Page 17: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Acknowledgements

12/23/2017 FOSS4G Conference | Boston, MA 17

• Tufts Geospatial Team: Carolyn Talmadge, Chris Barnett, Szuhui Wu, Annie Swafford, Kristen Lee, Adrian Sharpe, Patrick Florance.

• Graduate students: Sam Boiler.

• Others: Faculty and members of OGP who assisted in data selection, DataRescue Boston, the #DataRefugeslack channel, all in Bromfield House.

Page 18: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Thank you!

Kyle M. Monahan

Statistics & Research Technology Specialist

Tufts University

[email protected]

12/23/2017 FOSS4G Conference | Boston, MA 18

kylemonahan.info datalab.tufts.edu

For more information:

Page 19: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Questions?

5 minutes

12/23/2017 FOSS4G Conference | Boston, MA 19

Page 20: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Extra Slides – Python Code

12/23/2017 FOSS4G Conference | Boston, MA 20

Page 21: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Extra Slides – Python Code

12/23/2017 FOSS4G Conference | Boston, MA 21

Page 22: Tufts Spatial Data Rescue: Crawling at-risk Government Data

Extra Slides – Details about tkinter GUI

12/23/2017 FOSS4G Conference | Boston, MA 22