welcome to the course! - s3. · pdf file2 john beatles guitar 3 paul beatles bass rows from...

35
JOINING DATA IN R WITH DPLYR Welcome to the course!

Upload: vonga

Post on 06-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Welcome to the course!

Page 2: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Var_1 Var_2 Var_3 Var_4

obs_1 33 3 54

obs_2 20 90 22

obs_3 58 12 15

obs_4 83 81 5

> mean(df$Var_2) [1] 48.5

Page 3: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Var_1 Var_2 Var_3 Var_4 Var_5

obs_1 33 3 54 87

obs_2 20 90 22 42

obs_3 58 12 15 73

obs_4 83 81 5 88

> df$Var_5 <- df$Var_2 + df$Var_4

Var_1 Var_2 Var_3 Var_4

obs_1 33 3 54

obs_2 20 90 22

obs_3 58 12 15

obs_4 83 81 5

Page 4: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Page 5: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Course outline● Chapter 1 - Mutating joins

● Chapter 2 - Filtering joins and set operations

● Chapter 3 - Assembling data

● Chapter 4 - Advanced joining

● Chapter 5 - Case study

+ =

+ =

+ =

Page 6: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

● arrange()

● filter()

● select()

● mutate()

● summarise()

Page 7: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

merge()

Page 8: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Benefits of dplyr join functions● Always preserve row order

● Intuitive syntax

● Can be applied to databases, spark, etc.

Page 9: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Page 10: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Let’s practice!

Page 11: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Keys

Page 12: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Page 13: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

# Example join output name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass 4 Keith <NA> Guitar

Page 14: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

# Example join output name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass 4 Keith <NA> Guitar

Keys primary key foreign key

Page 15: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> names2 name surname band 1 John Coltrane NA 2 John Lennon Beatles 3 Paul McCartney Beatles

> plays2 name surname plays 1 John Lennon Guitar 2 Paul McCartney Bass 3 Keith Richards Guitar

# Example join output name surname band plays 1 John Coltrane <NA> <NA> 2 John Lennon Beatles Guitar 3 Paul McCartney Beatles Bass 4 Keith Richards <NA> Guitar

Keys primary key foreign key

Page 16: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Let’s practice!

Page 17: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Joins

Page 18: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> left_join(names, plays, by = "name")

table to augment

table to augment with

key column name(s) as a character string

left_join()> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass

rows from first table

values from second table

Page 19: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> names2 name surname band 1 John Coltrane NA 2 John Lennon Beatles 3 Paul McCartney Beatles

> plays2 name surname plays 1 John Lennon Guitar 2 Paul McCartney Bass 3 Keith Richards Guitar

Multi-column keys

> left_join(names2, plays2, by = c("name", "surname"))

vector of key column name(s)

name surname band plays 1 John Coltrane <NA> <NA> 2 John Lennon Beatles Guitar 3 Paul McCartney Beatles Bass

Page 20: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> right_join(names, plays, by = "name") name band plays 1 John Beatles Guitar 2 Paul Beatles Bass 3 Keith <NA> Guitar

right_join()> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

values from first table

rows from second table

Page 21: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

"tables"● data frames

● tibbles (tbl_df)

● tbl references

Page 22: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> # A data frame > mtcars > # entire data frame prints, leaving only last values of last columns (which have been wrapped around to appear below the first columns) visible in the window, as below

Camaro Z28 3 4 Pontiac Firebird 3 2 Fiat X1-9 4 1 Porsche 914-2 5 2 Lotus Europa 5 2 Ford Pantera L 5 4 Ferrari Dino 5 6 Maserati Bora 5 8 Volvo 142E 4 2

> library(tibble) > as.tibble(mtcars) # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 # ... with 22 more rows, and 1 more variables: carb <dbl>

tibble vs. data frame

Page 23: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

"tables"● data frames

● tibbles (tbl_df)

● tbl references

Page 24: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Let’s practice!

Page 25: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Mutating joins

Page 26: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> mutate(pressure[1:4, ], fahrenheit = temperature * 1.8 + 32) temperature pressure fahrenheit 1 0 0.0002 32 2 20 0.0012 68 3 40 0.0060 104 4 60 0.0300 140

> pressure[1:4, ] temperature pressure 1 0 0.0002 2 20 0.0012 3 40 0.0060 4 60 0.0300

mutate()

Page 27: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> left_join(names, plays, by = "name") name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

left_join()> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

Page 28: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> right_join(names, plays, by = "name") name band plays 1 John Beatles Guitar 2 Paul Beatles Bass 3 Keith <NA> Guitar

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

right_join()> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

Page 29: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> inner_join(names, plays, by = "name") name band plays 1 John Beatles Guitar 2 Paul Beatles Bass

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

inner_join()> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

Page 30: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> full_join(names, plays, by = "name") name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass 4 Keith <NA> Guitar

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

full_join()> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

Page 31: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Syntax> left_join( names, plays, by = "name") > right_join(names, plays, by = "name") > inner_join(names, plays, by = "name") > full_join( names, plays, by = "name")

x y by

%>%

Page 32: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> x <- 1:10

> x %>% sum() [1] 55

Pipe operator

> sum(x) [1] 55

> abs(diff(range(x))) [1] 9

> x %>% > range() %>% > diff() %>% > abs() [1] 9

Page 33: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

> names %>% + full_join(plays, by = "name") %>% + mutate(missing_info = is.na(band) | is.na(plays)) %>% + filter(missing_info == TRUE) %>% + select(name, band, plays) name band plays 1 Mick Stones <NA> 2 Keith <NA> Guitar

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

dplyr and pipes> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

Page 34: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

Joining Data in R with dplyr

Summary● left_join()

● right_join()

● inner_join()

● full_join()

=

=

=

=

Page 35: Welcome to the course! - s3. · PDF file2 John Beatles Guitar 3 Paul Beatles Bass rows from first table values from second table. Joining Data in R with dplyr > names2 name surname

JOINING DATA IN R WITH DPLYR

Let’s practice!