dplyr: manipulating your data - wordpress.com · dplyr: manipulating your data washington...
TRANSCRIPT
dplyr: manipulating your data
Washington University in St. Louis
September 14, 2016
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 1 / 44
1 OverviewData manipulation as a part of the data analysis pipelinedplyr: why its awesomedplyr: how do we use it?
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 2 / 44
Some resources for dplyr
Hadley Wickham’s online tutorials:http://www.r-bloggers.com/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/
Vignettes:https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
RStudio Blog:http://blog.rstudio.org/2014/01/17/introducing-dplyr/
NOTE: Today’s presentation on dplyr is heavily based on materials fromHadley Wickham’s 2014 tutorial! If you would like more in-depth resourcesabout it, I highly recommend going there first. (In other words, I take nocredit for this presentation – all credits to RStudio and Hadley Wickhamfor creating an awesome tutorial on dplyr)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 3 / 44
the circle of data processing life
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 4 / 44
what we will cover in dplyr
World Happiness Data (aka Homework 1.csv)
Single table verbs & grouped summaries
Data pipelines
Joins (two table verbs)
Do
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44
what we will cover in dplyr
World Happiness Data (aka Homework 1.csv)
Single table verbs & grouped summaries
Data pipelines
Joins (two table verbs)
Do
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44
what we will cover in dplyr
World Happiness Data (aka Homework 1.csv)
Single table verbs & grouped summaries
Data pipelines
Joins (two table verbs)
Do
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44
what we will cover in dplyr
World Happiness Data (aka Homework 1.csv)
Single table verbs & grouped summaries
Data pipelines
Joins (two table verbs)
Do
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44
what we will cover in dplyr
World Happiness Data (aka Homework 1.csv)
Single table verbs & grouped summaries
Data pipelines
Joins (two table verbs)
Do
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44
some bad news and some good news
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 6 / 44
some bad news and some good news
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 6 / 44
Time for Some RStudio!
Now you’ll want to open up Rstudio & read in the ’Homework 1.csv’dataset
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 7 / 44
Installing and Loading dplyr
If you want to install the package via the command line:>> install.packages("dplyr")
Remember that you’ll also want to load the package:>> library(dplyr)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 8 / 44
World Happiness data
1 Happiness: subjective happiness
2 GDP: Log gross domestic product per capita
3 Support: subjective support from friends
4 Life: healthy life expectancy at birth
5 Freedom: satisfied or dissatisfied with freedom
6 Generosity: donated to charity in past month
7 Corruption: corruption widespread?
Load the all of the data by important the "Homework 1.csv" file.
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 9 / 44
The 5 most important verbs in dplyr
1 filter: keep rows matching criteria
2 select: pick columns by name
3 arrange: reorder rows
4 mutate: add new variables
5 summarise: reduce variables to values
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44
The 5 most important verbs in dplyr
1 filter: keep rows matching criteria
2 select: pick columns by name
3 arrange: reorder rows
4 mutate: add new variables
5 summarise: reduce variables to values
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44
The 5 most important verbs in dplyr
1 filter: keep rows matching criteria
2 select: pick columns by name
3 arrange: reorder rows
4 mutate: add new variables
5 summarise: reduce variables to values
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44
The 5 most important verbs in dplyr
1 filter: keep rows matching criteria
2 select: pick columns by name
3 arrange: reorder rows
4 mutate: add new variables
5 summarise: reduce variables to values
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44
The 5 most important verbs in dplyr
1 filter: keep rows matching criteria
2 select: pick columns by name
3 arrange: reorder rows
4 mutate: add new variables
5 summarise: reduce variables to values
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44
Structure
1 First argument is a data frame
2 Subsequent arguments say what to do with data frame
3 Always return a data frame
4 (Never modify in place, you’ll want to assign the output data frameto an object)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44
Structure
1 First argument is a data frame
2 Subsequent arguments say what to do with data frame
3 Always return a data frame
4 (Never modify in place, you’ll want to assign the output data frameto an object)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44
Structure
1 First argument is a data frame
2 Subsequent arguments say what to do with data frame
3 Always return a data frame
4 (Never modify in place, you’ll want to assign the output data frameto an object)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44
Structure
1 First argument is a data frame
2 Subsequent arguments say what to do with data frame
3 Always return a data frame
4 (Never modify in place, you’ll want to assign the output data frameto an object)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44
A simple example
df <- data.frame(
color = c("blue","black","blue","blue","black"),
value = 1:5)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 12 / 44
Filter the rows that are blue
filter(df, color == "blue")
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 13 / 44
Filter the rows that are blue
filter(df, color == "blue")
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 13 / 44
Filter based on certain values
filter(df, value %in% c(1,4))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 14 / 44
Filter based on certain values
filter(df, value %in% c(1,4))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 14 / 44
Some more boolean operators
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 15 / 44
Data Exercise: Find all countries...
1 That begin with J (Japan and Jordan)
2 Classified as World 1
3 With Life between 60 and 70
4 That are both World 1 and Life is between 60 and 70
5 Where Corruption was less than Generosity
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44
Data Exercise: Find all countries...
1 That begin with J (Japan and Jordan)
2 Classified as World 1
3 With Life between 60 and 70
4 That are both World 1 and Life is between 60 and 70
5 Where Corruption was less than Generosity
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44
Data Exercise: Find all countries...
1 That begin with J (Japan and Jordan)
2 Classified as World 1
3 With Life between 60 and 70
4 That are both World 1 and Life is between 60 and 70
5 Where Corruption was less than Generosity
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44
Data Exercise: Find all countries...
1 That begin with J (Japan and Jordan)
2 Classified as World 1
3 With Life between 60 and 70
4 That are both World 1 and Life is between 60 and 70
5 Where Corruption was less than Generosity
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44
Data Exercise: Find all countries...
1 That begin with J (Japan and Jordan)
2 Classified as World 1
3 With Life between 60 and 70
4 That are both World 1 and Life is between 60 and 70
5 Where Corruption was less than Generosity
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44
Flight Exercise: Find all countries...
1 J < − filter(data, Country == ”Japan” | Country == ”Jordan”)
2 W1 < − filter(data, World==1)
3 life < − filter(Hdata,Life>60 & Life<70)
4 lifeW1 < − filter(data,Life>60 & Life<70 & World == 1)
5 cg < − filter(data, Corruption < Generosity)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 17 / 44
Select
With the ”select()” function, you can pick variables that you are mostinterested. For example, you can treat names of variables like positions.select(df, color)
select(df, -color)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 18 / 44
Your Turn
Read the help for select(). What other ways can you select variables?
Write down (in R) three ways that you can select the two delay variablesin your flight data.
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 19 / 44
5 ways to select your data
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 20 / 44
Many ways, same delays
select(data, c(Happiness,GDP,Support)
select(data, c(Happiness,GDP,Support)
select(data, starts with("G"))
select(data, contains("G"))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 21 / 44
Arrange
The purpose of the ”arrange()” function is to change the order of yourrows.arrange(df, color)
arrange(df, desc(color))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 22 / 44
Your Turn
Order the dataset by departure Happiness and GDP.Which countries were happiest?If we switch the order to GPD and Happiness, what countries are leasthappy?If we order by descending Happiness and GDP, what happens?
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 23 / 44
Arrange Away
arrange(data, Happiness, GDP)
arrange(flights, GPD, Happiness)
arrange(flights, desc(Happiness,GDP))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 24 / 44
Arrange Away
arrange(data, Happiness, GDP)
arrange(flights, GPD, Happiness)
arrange(flights, desc(Happiness,GDP))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 24 / 44
Mutate
”mutate()” will allow you to add new variables as a function of existingvariables.mutate(df, double = 2 * value)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 25 / 44
Mutate with compound statements
mutate() will also allow you to perform additional transformations onnewly created variables. Neat!mutate(df, double = 2 * value, quadruble = 2 * double)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 26 / 44
Your Turn
Reverse score the corruption variable (like in the homework and to yourdata frame), a.
Standardize your new corruption variable and add that to your data frame.
(Hint: you may need to use select() or View() to see your new variable
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 27 / 44
Mutate
data < − mutate(data, Corruption r = Corruption*-1)
arrange(flights, desc(Corruption r))
data < − mutate(data, Corruption z =
scale(Corruption r, center = TRUE, scale = TRUE)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 28 / 44
Mutate
data < − mutate(data, Corruption r = Corruption*-1)
arrange(flights, desc(Corruption r))
data < − mutate(data, Corruption z =
scale(Corruption r, center = TRUE, scale = TRUE)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 28 / 44
Summarise
”summarise()” will give you a 1-row dataframe. This is not particularlyuseful.summarise(df, total = sum(value))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 29 / 44
Group, then summarise
It is much more useful to group your data and then summarise it.by color < − group by(df, color)
summarise(by color, total = sum(value))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 30 / 44
Grouping the World Happiness data
by world < − group by(data,World)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 31 / 44
Summary functions
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 32 / 44
Your Turn!
Now that you understand the group by() function and summarise()function, how might we want to summarise the GDP by World? (Thereare probably many ways to do this).
What is the average and standard deviation of GDP when you group byWorld?
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 33 / 44
Group by date of departure and summarise delays
by world <- group by(data,World)
GDP by World <- summarise(by world, mean = mean(GDP, na.rm =
TRUE, sd = sd(GDP, na.rm = TRUE)
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 34 / 44
Data pipelines
In real data manipulation, you’re probably not going to just use one verb,but you’re going to use multiple verbs at the same time. This is whereyou’ll want to use data pipelines, which link a bunch of functions intoreadable code. So instead of this...
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 35 / 44
Data pipelines
...you can have this!
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 36 / 44
Joining datasets
Sometimes, you will want to join two separate datasets. Like in theexample below, where you want to join two data frames.
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 37 / 44
Create two dataframes
x < − data.frame(
name = c("John", "Paul", "George", "Ringo", "Stuart", "Pete"),
instrument = c("guitar", "bass", "guitar", "drums", "bass",
"drums"))
y < − data.frame( name = c("John", "Paul", "George", "Ringo",
"Brian"), band = c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE"))
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 38 / 44
inner join()
Include only rows in both x and y.
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 39 / 44
left join()
Include all of x, and matching rows in y.
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 40 / 44
semi join
Include only rows of x that match y
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 41 / 44
anti join
Include only rows of x that DON’T match y
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 42 / 44
Summary of all the join functions
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 43 / 44
Do function
In the case where none of these functions can do what you want to do tomanipulate the data, you can always use the do() function. It is slower,but more general purpose, and is similar to ddply() and dlply, if you haveused those functions.
(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 44 / 44