data and donuts: data cleaning with openrefine
TRANSCRIPT
![Page 1: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/1.jpg)
Data Cleaning using
OpenRefine
C. Tobin Magle, PhDNov. 9, 2016
10:00-11:00 a.m.Morgan Library Computer
Classroom 175
*inspired by content from Data Carpentry
![Page 2: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/2.jpg)
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Analysis
The research cycle
![Page 3: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/3.jpg)
Tidy Data
1. Columns as variables
• Don’t combine multiple pieces of info in one column
2. Rows as observations
• One measured value
![Page 4: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/4.jpg)
Demo: Clean Survey data
• Find a partner
• Download the data: http://tinyurl.com/zlfoat6
• Open up the data in a spreadsheet program
• Look at 2013 and 2014 tabs: create a new tab and reformat the data into one tidy spreadsheet.
• What columns do we need?
![Page 5: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/5.jpg)
Open Refine
• Doesn’t modify original
• Tracks changes you made
• Easily reversible
• Complex clustering algorithms
![Page 6: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/6.jpg)
Survey data
• Rows: observations of individual animals
• Columns: Variables that describe the animals
• Species, sex, date, location, etc
• Messy Data• Misspellings• White space• Combined variables
![Page 7: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/7.jpg)
Create a project• Download the file: http://tinyurl.com/qjjqlby
![Page 8: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/8.jpg)
Preview
![Page 9: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/9.jpg)
Removing Whitespace
• Click the blue triangle to the left of the column header
• Edit cells
• Common transforms
• Remove leading and trailing whitespace
1
234
![Page 10: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/10.jpg)
Faceting
• ScientificName column
• Click down arrow: select text facets
• Look at possible values of the column on the left
• Edit the facets
1
2 3
4
![Page 11: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/11.jpg)
Clustering
![Page 12: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/12.jpg)
Select Clustering Algorithm
![Page 13: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/13.jpg)
Merge and re-cluster
![Page 14: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/14.jpg)
Repeat Merge and Re-cluster
Until there are no more clusters…
![Page 15: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/15.jpg)
Split
• Edit Column > Split
• Put space as separator
• Result: new columns
1
2 3
4
5
![Page 16: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/16.jpg)
Undo/Redo
• All your steps are saved!
• Click where it says Undo / Redo • Left frame
• Click on the step to revert to
• Result: data change.
![Page 17: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/17.jpg)
Saving Scripts
• Export the steps for reuse
• In the Undo / Redo section, click Extract
• Select the steps you want to keep
• Save code as .txt file using a text editor
![Page 18: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/18.jpg)
Applying Scripts
• Run the same steps on a similar document
• Click apply
• Paste in codePaste
![Page 19: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/19.jpg)
Saving and exporting a project
• Autosave feature
• Click 'Export' button (top right)
• Select 'Export project'
• Result: a compressed file that contains
• Data• Cleaning steps
![Page 20: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/20.jpg)
Importing a project
• Found in the menu where you crease/open projects• Loads data and history
![Page 21: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/21.jpg)
Exporting data
• Go to 'Export' in the top right.
• Click on the file type you want to export the data in.
• 'Tab-separated values'• 'Comma-separated values'
![Page 22: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/22.jpg)
Subsetting data 2 ways: Facet• Facet the species column• Click on a facet
![Page 23: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/23.jpg)
Subsetting 2 ways: Text filter• Example: Find all records collected in Hawaii• Unstructured data: many facets contain “Hawaii”• Text filter = “Hawaii”
![Page 24: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/24.jpg)
Reshaping data
Wide format = not tidy
Tall format = Tidy
• Both rows and columns are variables• Column headers are values, not
variable names
![Page 25: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/25.jpg)
Reshaping Lou’s data
Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.
• Variables for weight: Date, mouse number, infection status, value
• Variables for cytokine levels: Date, mouse number, infection status, value
![Page 26: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/26.jpg)
Weight data format
• Mouse # across the top• Days as rows• Month in file name• Infection status in secondary column header
![Page 27: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/27.jpg)
Tidying the mouse weight data• Download data http://tinyurl.com/hvna4mg
• Import April_weight.xls
• Transpose cells in across columns into rows
• Split the mouse column on “ “
• Edit/delete columns
• Export script, use on other spreadsheets
![Page 28: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/28.jpg)
Import: ignore first line
![Page 29: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/29.jpg)
Transpose cells across columns to rows
1
3
2
4
5
![Page 30: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/30.jpg)
Split the mouse_number column
1
2 – sep by space
3 – 3 new columns
![Page 31: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/31.jpg)
Delete/Change Column Names
![Page 32: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/32.jpg)
Facet Treatment Column
![Page 33: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/33.jpg)
Edit Facets
![Page 34: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/34.jpg)
Tidy Data!!Variables: • Days• Mouse_number• Treatment• Month in file name
Rows: weight from one mouse on one day
Can add month column and merge with other files in R! (next time?)
![Page 35: Data and Donuts: Data cleaning with OpenRefine](https://reader035.vdocuments.site/reader035/viewer/2022062522/58738b501a28ab272d8b6a93/html5/thumbnails/35.jpg)
Need help?
• Email: [email protected]
• Data Management Services website: http://lib.colostate.edu/services/data-management
• Data Carpentry: http://www.datacarpentry.org/• OpenRefine Lesson: http://www.datacarpentry.org/OpenRefine-ecology-lesson/