![Page 1: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/1.jpg)
Tufts Spatial Data Rescue: Crawling at-risk Government
DataKyle Monahan
Statistics and Research Technology SpecialistTufts University
FOSS4G | Boston, MA | 8/17/2017
![Page 2: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/2.jpg)
Background
•What is a data rescue?
• Methods and techniques to identify, store and preserve datasets
• Predominantly data associated with government entities• *.gov, *.mil, *.edu, *.org, etc.
• Especially critical during election transitions
12/23/2017 FOSS4G Conference | Boston, MA 2
![Page 3: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/3.jpg)
Background - History
12/23/2017 KMM 3
![Page 4: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/4.jpg)
Example: 2008 End of Term Harvest
12/23/2017 FOSS4G Conference | Boston, MA 4
• National Archives and Records Administration announced they would be unable to rescue data as they did in 2004.
• International Internet Preservation Consortium (IIPC) responded by organizing a crawl:• California Digital Library• Internet Archive• Government Printing Office• Library of Congress• University of North Texas
• Goal: “comprehensive harvest” (EOTerm Archive, 2016)
![Page 5: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/5.jpg)
Example: 2008 End of Term Harvest
12/23/2017 FOSS4G Conference | Boston, MA 5
•Consisted of three main crawls: • Pre-election
• Post-election
• Post-inauguration
•Produced over 16 TB of data
•And 160,211,356 URIs (Phillips, 2016)End of Term, 2008
![Page 6: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/6.jpg)
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 6
•Access to up-to-date Federal data is critical for our Data Lab
•GIS and statistics classes rely on federal data (e.g. US Census, TRI, HUD)
![Page 7: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/7.jpg)
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 7
• Inquired about key data for faculty and staff at Tufts
•Also reached out to the Open Geoportal community
•Created a list of critical data sources that enable research and learning at Tufts and beyond
Google Docs
OGP Outreach
![Page 8: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/8.jpg)
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 8
•Used an FTP program called Filezilla
•Can re-initiate connections after failure
•Ran on multiple computers overnight, set to mirror different FTP sites.
![Page 9: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/9.jpg)
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 9
•Also completed collection of data from speed-limited locations by traditional mail
•Placed 128 GB flash drive in an envelope• Caught in a storm, but still much
faster than dial-up speed
![Page 10: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/10.jpg)
Summary of Results
12/23/2017 FOSS4G Conference | Boston, MA 10
16
31
40
0
10
20
30
40
50
2008 2012 2017
Dat
a R
ecu
sed
, TB
Year of Harvest
Data Rescue, Estimated Harvest
From Tufts alone – likely much higher for all data rescues!Source: Phillips, 2016
![Page 11: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/11.jpg)
Results of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 11
Word Cloud (highest frequency terms)
![Page 12: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/12.jpg)
Development of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 12
•High volume of data – much of it zipped• Some further compressed inside zip
files
•Needed a lightweight tool to assess what data was captured• Solution Python script
![Page 13: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/13.jpg)
Development of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 13
•Packaged the Python script in a GUI using Tkinter• Object-oriented layer on Tcl/Tk
•Allows for users unfamiliar with Python to use the tool
•Provides a simple interface and clear results
![Page 14: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/14.jpg)
Results of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 14
Unzips files
Records type of file
Organizes XML data
![Page 15: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/15.jpg)
Using the Tufts Crawler – Take a Look
12/23/2017 FOSS4G Conference | Boston, MA 15
![Page 16: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/16.jpg)
Summary & Future Work
12/23/2017 FOSS4G Conference | Boston, MA 16
• Tufts identified federal data perceived “at-risk”• Harvested over 40 TB of data, mostly
compressed
• Developed Tufts Crawler to unpack and categorize types, sizes and other metadata.
• Future work: pack into .exe, estimate progress bar
![Page 17: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/17.jpg)
Acknowledgements
12/23/2017 FOSS4G Conference | Boston, MA 17
• Tufts Geospatial Team: Carolyn Talmadge, Chris Barnett, Szuhui Wu, Annie Swafford, Kristen Lee, Adrian Sharpe, Patrick Florance.
• Graduate students: Sam Boiler.
• Others: Faculty and members of OGP who assisted in data selection, DataRescue Boston, the #DataRefugeslack channel, all in Bromfield House.
![Page 18: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/18.jpg)
Thank you!
Kyle M. Monahan
Statistics & Research Technology Specialist
Tufts University
12/23/2017 FOSS4G Conference | Boston, MA 18
kylemonahan.info datalab.tufts.edu
For more information:
![Page 19: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/19.jpg)
Questions?
5 minutes
12/23/2017 FOSS4G Conference | Boston, MA 19
![Page 20: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/20.jpg)
Extra Slides – Python Code
12/23/2017 FOSS4G Conference | Boston, MA 20
![Page 21: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/21.jpg)
Extra Slides – Python Code
12/23/2017 FOSS4G Conference | Boston, MA 21
![Page 22: Tufts Spatial Data Rescue: Crawling at-risk Government Data](https://reader031.vdocuments.site/reader031/viewer/2022030317/5a65a84f7f8b9ac2368b4995/html5/thumbnails/22.jpg)
Extra Slides – Details about tkinter GUI
12/23/2017 FOSS4G Conference | Boston, MA 22