tufts spatial data rescue: crawling at-risk government data
TRANSCRIPT
Tufts Spatial Data Rescue: Crawling at-risk Government
DataKyle Monahan
Statistics and Research Technology SpecialistTufts University
FOSS4G | Boston, MA | 8/17/2017
Background
•What is a data rescue?
• Methods and techniques to identify, store and preserve datasets
• Predominantly data associated with government entities• *.gov, *.mil, *.edu, *.org, etc.
• Especially critical during election transitions
12/23/2017 FOSS4G Conference | Boston, MA 2
Background - History
12/23/2017 KMM 3
Example: 2008 End of Term Harvest
12/23/2017 FOSS4G Conference | Boston, MA 4
• National Archives and Records Administration announced they would be unable to rescue data as they did in 2004.
• International Internet Preservation Consortium (IIPC) responded by organizing a crawl:• California Digital Library• Internet Archive• Government Printing Office• Library of Congress• University of North Texas
• Goal: “comprehensive harvest” (EOTerm Archive, 2016)
Example: 2008 End of Term Harvest
12/23/2017 FOSS4G Conference | Boston, MA 5
•Consisted of three main crawls: • Pre-election
• Post-election
• Post-inauguration
•Produced over 16 TB of data
•And 160,211,356 URIs (Phillips, 2016)End of Term, 2008
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 6
•Access to up-to-date Federal data is critical for our Data Lab
•GIS and statistics classes rely on federal data (e.g. US Census, TRI, HUD)
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 7
• Inquired about key data for faculty and staff at Tufts
•Also reached out to the Open Geoportal community
•Created a list of critical data sources that enable research and learning at Tufts and beyond
Google Docs
OGP Outreach
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 8
•Used an FTP program called Filezilla
•Can re-initiate connections after failure
•Ran on multiple computers overnight, set to mirror different FTP sites.
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 9
•Also completed collection of data from speed-limited locations by traditional mail
•Placed 128 GB flash drive in an envelope• Caught in a storm, but still much
faster than dial-up speed
Summary of Results
12/23/2017 FOSS4G Conference | Boston, MA 10
16
31
40
0
10
20
30
40
50
2008 2012 2017
Dat
a R
ecu
sed
, TB
Year of Harvest
Data Rescue, Estimated Harvest
From Tufts alone – likely much higher for all data rescues!Source: Phillips, 2016
Results of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 11
Word Cloud (highest frequency terms)
Development of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 12
•High volume of data – much of it zipped• Some further compressed inside zip
files
•Needed a lightweight tool to assess what data was captured• Solution Python script
Development of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 13
•Packaged the Python script in a GUI using Tkinter• Object-oriented layer on Tcl/Tk
•Allows for users unfamiliar with Python to use the tool
•Provides a simple interface and clear results
Results of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 14
Unzips files
Records type of file
Organizes XML data
Using the Tufts Crawler – Take a Look
12/23/2017 FOSS4G Conference | Boston, MA 15
Summary & Future Work
12/23/2017 FOSS4G Conference | Boston, MA 16
• Tufts identified federal data perceived “at-risk”• Harvested over 40 TB of data, mostly
compressed
• Developed Tufts Crawler to unpack and categorize types, sizes and other metadata.
• Future work: pack into .exe, estimate progress bar
Acknowledgements
12/23/2017 FOSS4G Conference | Boston, MA 17
• Tufts Geospatial Team: Carolyn Talmadge, Chris Barnett, Szuhui Wu, Annie Swafford, Kristen Lee, Adrian Sharpe, Patrick Florance.
• Graduate students: Sam Boiler.
• Others: Faculty and members of OGP who assisted in data selection, DataRescue Boston, the #DataRefugeslack channel, all in Bromfield House.
Thank you!
Kyle M. Monahan
Statistics & Research Technology Specialist
Tufts University
12/23/2017 FOSS4G Conference | Boston, MA 18
kylemonahan.info datalab.tufts.edu
For more information:
Questions?
5 minutes
12/23/2017 FOSS4G Conference | Boston, MA 19
Extra Slides – Python Code
12/23/2017 FOSS4G Conference | Boston, MA 20
Extra Slides – Python Code
12/23/2017 FOSS4G Conference | Boston, MA 21
Extra Slides – Details about tkinter GUI
12/23/2017 FOSS4G Conference | Boston, MA 22