cleaning validating and enhancing data with open refine idigbio – university of florida 2 nd...

36
Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo and iDigBio August 12 - 16, 2013

Upload: eunice-lyons

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

CleaningValidating andEnhancing Data with Open Refine

iDigBio – University of Florida2nd Train-the-Trainers Georeferencing Workshop

Deborah Paul, iDigInfo and iDigBioAugust 12 - 16, 2013

Page 2: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Pre & Post-Digitization

• Exposing Data to Outside Curation – Yipee! Feedback• Data Discovery• dupes, grey literature, more complete records,

annotations of many kinds, georeferenced records• Filtered PUSH Project• Scatter, Gather, Reconcile – Specify• iDigBio

• Planning for Ingestion of Feedback – Policy Decisions• re-determinations & the annotation dilemma• to re-image or not to re-image• “annotated after imaged”• to attach a physical annotation label to the specimen

from a digital annotation or not

Page 3: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Data curation, management, and enhancement• querying dataset to find & fix errors• kinds of errors• filename errors• typos• georeferencing errors• taxonomic errors• identifier and guid errors• format errors (dates)• mapping

• enhancing data• repetitive data cleaning, enhancement using scripts

Page 4: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Clean & Enhance Data with Tools

• Query / Report / Update features of Databases• Learn how to query your databases effectively• Learn SQL (MySQL, it’s not hard – really!)

• Gives you a communication tool with your IT staff• Using new tools• Kepler Kurator – Data Cleaning and Enhancement• Open Refine, desktop app

• from messy to marvelous• http://code.google.com/p/google-refine/• http://openrefine.org/• remove leading / trailing white spaces• standardize values• call services for more data• just what is a “service” anyway?

• the magic of undo• using scripts for repetitive data manipulation tasks

• Google Fusion Tables

Page 5: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

OpenRefine

• A power tool for working with messy data.• Got Data in a Spreadsheet or other format…• TSV, CSV, *SV, Excel (.xls and .xlsx),• JSON,• XML,• RDF as XML,• Wiki markup, and • Google Data documents are all supported.

• the software tool formerly known as GoogleRefine

Page 6: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

http://openrefine.org/

• Install

Page 7: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo
Page 8: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Configure parsing optionsthen click Create Project

Page 9: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Facet, text facet

Page 10: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo
Page 11: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo
Page 12: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo
Page 13: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Enhance Data

• Call “web services”• GEOLocate example• your data has locality, county, state, country fields• limit data to a given state, county• build query• "http://www.museum.tulane.edu/webservices/

geolocatesvcv2/glcwrap.aspx?Country=USA&state=ID&county=Idaho&fmt=json&Locality="+escape(value,'url')• service returns json output• latitude, longitude values now in your dataset.

• Google Fusion tables

Page 14: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Add column by fetching URLs…

Page 15: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo
Page 16: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Data back from GEOLocate in JSON format

Page 17: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Parsing json

• How do we get our longitude and latitude out of the json?• Parsing (it’s not hard – don’t panic)!

Page 18: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Parsing json• Copy and paste the text below into • http://jsonformatter.curiousconcept.com/

• { "engineVersion" : "GLC:4.92|U:1.01374|eng:1.0", "numResults" : 2, "executionTimems" : 187.2048, "resultSet" : { "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": {"type": "Point", "coordinates": [-115.11639, 46.33778]}, "properties": { "parsePattern" : "EAGLE MOUNTAIN", "precision" : "Low", "score" : 29, "uncertaintyRadiusMeters" : 1807, "uncertaintyPolygon" : "Unavailable", "displacedDistanceMiles" : 0, "displacedHeadingDegrees" : 0, "debug" : ":GazPartMatch=False|:inAdm=True|:Adm=IDAHO|:NPExtent=3000|:NP=EAGLE MOUNTAIN|:KFID=|EAGLE MOUNTAIN" } }, { "type": "Feature", "geometry": {"type": "Point", "coordinates": [-115.89, 46.0225]}, "properties": { "parsePattern" : "CLEARWATER", "precision" : "Low", "score" : 25, "uncertaintyRadiusMeters" : 3036, "uncertaintyPolygon" : "Unavailable", "displacedDistanceMiles" : 0, "displacedHeadingDegrees" : 0, "debug" : ":GazPartMatch=False|:inAdm=True|:Adm=IDAHO|:NPExtent=5040|:NP=CLEARWATER|:KFID=|CLEARWATER" } } ], "crs": { "type" : "EPSG", "properties" : { "code" : 4326 }} } }

Page 19: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

http://jsonformatter.curiousconcept.com/

Copy json output in the spreadsheet, paste it here.Click on process button (lower right of this screen).

Page 20: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Parsing json

Page 21: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Parsing latitude

Page 22: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Parsing longitude

Page 23: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

The Results!

Page 24: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Reconciliation, an example

•What if you’d like to compare your taxon names (or author names, perhaps) to those from a standardized source?• Try reconciliation using Refine to call some

taxonomic name resolution service.•What follows is only an example. There are

other services against which to compare and enhance your data.

Page 25: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

• The Name Validation Example that follows was set up by Gaurav Vaidya (University of Colorado). Click link to read more.• Next, see screen shots for steps.• All links subject to changing over time.

Page 26: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

1st, duplicate and rename the scientificName column

Page 27: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

2nd add a Reconciliation Service

• from drop-down of reconcileScientificName click Start reconciling• in next pop-up click Add

Standard Service

• in pop-up enter url: http://refine.taxonomics.org/gbifchecklists/reconcile

• click Add Service

Page 28: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo
Page 29: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

• cells with Create new topic found no match

• click to accept this value for this cell only.

• click the to accept and apply this value to all cells that had the original value

• choosing the 1st option with creates a cell with value of Abrus precatorius L. (Plantae)

• to undo, click Choose new match

Page 30: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Refine data using and re-using scripts

• Refine records each step done and creates a reusable script.

• to undo step 5 (reconcile), just click on 5.

• to undo create new column (step 4) click on step 4.

• Extract all steps as a script, to reuse with the next dataset that has identical issues and is formatted the same way.

• Save script as a .txt file• Paste script in new dataset* and click

Apply• dataset must be like this one

(same original headers and column order)

Page 31: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

click Extractcopy and save output as a .txt file

Page 32: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

In a new dataset, easily repeat the same tasks

• got a dataset just like the one before?• different values, same columns?

• open the dataset in Refine• click on Undo / Redo• click Apply…• in the Apply Operation History• paste the same script in here

• click Perform Operations• all the steps run

• Voila, you are done and the data is refined!

Page 33: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Recipes for enhancing your data!

• Someone’s already written the same script to do the same task.•What scripts have others written?• You can use them, you don’t have to write

your own.•Where are they?• See https://github.com/OpenRefine/OpenRefine/wiki/Recipes

Page 34: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

How to get started?• This powerpoint and accompanying CSV• Open Refine videos and tutorials and recipes• see http://openrefine.org

• Join Google+ Open Refine Community• Read about Google Fusion Tables

• Please teach others about these power tools• Pay-it-forward• Get data that is “fit-for-research-use”• and have fun with your data

Page 35: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Thanks for trying out this new tool!

iDigBio is funded by a grant from the National Science Foundation's Advancing Digitization of Biodiversity Collections Program (#EF1115210). Views and opinions expressed are those of the author not necessarily those of the NSF.

This work would not be possible without prior input, assistance, contributions, and kindness from many including: Nelson Rios, Gaurav Vaidya, Guillaume Jimenez, Liath Appleton, Katja Seltmann, Laura Russell, Greg Riccardi, Stephen Flager, and Kevin Love. To each of you who contributed to demistifying data cleaning and enhancement, who provided new how-to’s, and stepped in new with explanations and examples, thank you!Deborah Paul, iDigBio

Page 36: Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo

Have fun with the data, no matter where you find it!

END