tufts spatial data rescue: crawling at-risk government data

Post on 22-Jan-2018

10 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tufts Spatial Data Rescue: Crawling at-risk Government

DataKyle Monahan

Statistics and Research Technology SpecialistTufts University

FOSS4G | Boston, MA | 8/17/2017

Background

•What is a data rescue?

• Methods and techniques to identify, store and preserve datasets

• Predominantly data associated with government entities• *.gov, *.mil, *.edu, *.org, etc.

• Especially critical during election transitions

12/23/2017 FOSS4G Conference | Boston, MA 2

Background - History

12/23/2017 KMM 3

Example: 2008 End of Term Harvest

12/23/2017 FOSS4G Conference | Boston, MA 4

• National Archives and Records Administration announced they would be unable to rescue data as they did in 2004.

• International Internet Preservation Consortium (IIPC) responded by organizing a crawl:• California Digital Library• Internet Archive• Government Printing Office• Library of Congress• University of North Texas

• Goal: “comprehensive harvest” (EOTerm Archive, 2016)

Example: 2008 End of Term Harvest

12/23/2017 FOSS4G Conference | Boston, MA 5

•Consisted of three main crawls: • Pre-election

• Post-election

• Post-inauguration

•Produced over 16 TB of data

•And 160,211,356 URIs (Phillips, 2016)End of Term, 2008

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 6

•Access to up-to-date Federal data is critical for our Data Lab

•GIS and statistics classes rely on federal data (e.g. US Census, TRI, HUD)

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 7

• Inquired about key data for faculty and staff at Tufts

•Also reached out to the Open Geoportal community

•Created a list of critical data sources that enable research and learning at Tufts and beyond

Google Docs

OGP Outreach

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 8

•Used an FTP program called Filezilla

•Can re-initiate connections after failure

•Ran on multiple computers overnight, set to mirror different FTP sites.

Methods of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 9

•Also completed collection of data from speed-limited locations by traditional mail

•Placed 128 GB flash drive in an envelope• Caught in a storm, but still much

faster than dial-up speed

Summary of Results

12/23/2017 FOSS4G Conference | Boston, MA 10

16

31

40

0

10

20

30

40

50

2008 2012 2017

Dat

a R

ecu

sed

, TB

Year of Harvest

Data Rescue, Estimated Harvest

From Tufts alone – likely much higher for all data rescues!Source: Phillips, 2016

Results of Tufts Crawl

12/23/2017 FOSS4G Conference | Boston, MA 11

Word Cloud (highest frequency terms)

Development of Tufts Crawler

12/23/2017 FOSS4G Conference | Boston, MA 12

•High volume of data – much of it zipped• Some further compressed inside zip

files

•Needed a lightweight tool to assess what data was captured• Solution Python script

Development of Tufts Crawler

12/23/2017 FOSS4G Conference | Boston, MA 13

•Packaged the Python script in a GUI using Tkinter• Object-oriented layer on Tcl/Tk

•Allows for users unfamiliar with Python to use the tool

•Provides a simple interface and clear results

Results of Tufts Crawler

12/23/2017 FOSS4G Conference | Boston, MA 14

Unzips files

Records type of file

Organizes XML data

Using the Tufts Crawler – Take a Look

12/23/2017 FOSS4G Conference | Boston, MA 15

Summary & Future Work

12/23/2017 FOSS4G Conference | Boston, MA 16

• Tufts identified federal data perceived “at-risk”• Harvested over 40 TB of data, mostly

compressed

• Developed Tufts Crawler to unpack and categorize types, sizes and other metadata.

• Future work: pack into .exe, estimate progress bar

Acknowledgements

12/23/2017 FOSS4G Conference | Boston, MA 17

• Tufts Geospatial Team: Carolyn Talmadge, Chris Barnett, Szuhui Wu, Annie Swafford, Kristen Lee, Adrian Sharpe, Patrick Florance.

• Graduate students: Sam Boiler.

• Others: Faculty and members of OGP who assisted in data selection, DataRescue Boston, the #DataRefugeslack channel, all in Bromfield House.

Thank you!

Kyle M. Monahan

Statistics & Research Technology Specialist

Tufts University

kyle.monahan@tufts.edu

12/23/2017 FOSS4G Conference | Boston, MA 18

kylemonahan.info datalab.tufts.edu

For more information:

Questions?

5 minutes

12/23/2017 FOSS4G Conference | Boston, MA 19

Extra Slides – Python Code

12/23/2017 FOSS4G Conference | Boston, MA 20

Extra Slides – Python Code

12/23/2017 FOSS4G Conference | Boston, MA 21

Extra Slides – Details about tkinter GUI

12/23/2017 FOSS4G Conference | Boston, MA 22

top related