data.table tutorial uros2018r-project.ro/conference2018/presentations/data... · 3 developers: matt...

41
Jaap Walhout September 12, 2018 tutorial uRos 2018

Upload: others

Post on 23-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

JaapWalhoutSeptember12,2018

tutorialuRos 2018

Page 2: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

2

1. Introduction

2. Fast read &write

3. Syntax

4. Basicoperations(filteringrows &selecting columns)

5. Summarizing

6. Adding /updatingvariables

7. Joining datasets

8. Reshaping data

Specialsymbols:.N +.SD +.I Specialoperator::=

Overview

Page 3: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

3

Developers:MattDowle,Arun Srinivasan,JanGorecki,MichaelChirico,Pasha Stetsenko,TomShort,SteveLianoglou,EduardAntonyan,MarkusBonsch,Hugh Parsonage

Since 2006 onCRAN,>35 releasesso far

678 packagesimport/depend/suggest data.table (543 CRAN+135 Bioconductor)

Homepage:http://r-datatable.com

Introduction

Page 4: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

4

Why use data.table?

Pros:- speed- memoryefficiency- coding flexibility- non-equi joins

Cons:- ‘different’syntax

Introduction

Page 5: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

5

50million rows /10columns/± 4GB

fread("datafile.csv")

expr time

data.table_fread 15.6 readr_read_csv 92.6 base_read.csv 559.9

fwrite(DT,"datafile.csv")

expr time

data.table_fread 32.6 readr_read_csv 102.2 base_read.csv 201.9

timesinseconds

Fast read &write

Page 6: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

6

Threemain enhancements:

1. Columnnames can be used asvariables inside [….]

2. Because they arevariables,wecan use columnnames

to calculate stuffinside [….]

3. Anadditional grouping argument:by

Syntax:data.table ==enhanced data.frame

Page 7: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

7

Columnar datastructure:2D– rows and columns

- subset rows df[df$id =="01",]

- select columns df[,"val1"]

- subset rows &select columns df[df$id =="01","val1"]

- that’s about it ….

Syntax:dataframerefresher

Page 8: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

8

DT[i,j,by]

Syntax:general form

Which rows? What to do? Grouped by what?

- vectorofrownumbers- logical vector- another data.table

- summarizing- updatingvariable(s)- adding variable(s)

- one ormorecolumns- onthe fly grouping var(s)

Page 9: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

9

DT[i,j,by]

data.table: i j bySQL: where select | update group by

Syntax:general form

Page 10: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

10

build iniris dataset:

irisDT <- as.data.table(iris)

Example data

Page 11: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

11

syntax:DT[i,j,by]

subset rows

select columns

subset rows &select columns

irisDT[Species=="setosa",]

irisDT[,Petal.Width]

irisDT[,.(Petal.Width)]

irisDT[Species=="setosa",Petal.Width]

irisDT[Species=="setosa",.(Petal.Width)]

Filteringrows &selecting columns

Page 12: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

12

subset rows irisDT[between(Petal.Width,1,1)]

irisDT[Petal.Width %between%c(1,2)]

select columns irisDT[,.(Species,Sepal.Length)]

Filteringrows &selecting columns

Page 13: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

13

Openthe fileex1.R

subset rows : getonly the rows with aday lower than or

equal to 10

select columns : selectonly the Month columnand make

sure you getadata.table back

subset rows &select columns : getonly the Wind&Tempcolumnsfor the

rows with aday higher than 5and lower

than orequal to 10

Exercise 1

Page 14: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

14

1. Counts

2. Aggregating

3. Groupby

Summarizing

Page 15: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

15

syntax:DT[i,j,by]

count irisDT[Species=="setosa",.N]

count distinct irisDT[,.uniqueN(Species)]

irisDT[Petal.Width <0.9,. uniqueN(Species)]

uniqueN(irisDT,by ="Species")

Counts

Page 16: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

16

syntax:DT[i,j,by]

Simpleaggregation: irisDT[,.(count =.N,average =mean(Petal.Width))]

Including filtering: irisDT[Petal.Width <0.9,.(count =.N,average =mean(Petal.Width))]

Aggregating

Page 17: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

17

syntax:DT[i,j,by]

irisDT[,.N,by =Species]

irisDT[,.(average =mean(Petal.Width)),by =Species]

irisDT[Sepal.Length <5.3,.(average =mean(Petal.Width)),by =Species]

irisDT[,.(average =mean(Petal.Width)),by =.(Species,logi =Sepal.Length <5.3)]

Groupby

Page 18: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

18

specialsymbol:.SD

SD=SubsetofData

- adata.table by itself

- holds dataofcurrent goup asdefined inby

- when noby,.SDapplies to whole data.table

- allows for calculations onmultiplecolumns

Groupby

Page 19: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

19

specialsymbol:.SD

irisDT[,lapply(.SD,mean),by =Species]

irisDT[Sepal.Length <5.3,lapply(.SD,mean),by =Species]

Groupby

Page 20: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

20

specialsymbol:.SD

specialsymbol:.SDcols

irisDT[,lapply(.SD,mean),by =Species,.SDcols =1:2]

irisDT[,lapply(.SD,mean),by =Species,.SDcols =grep("Length",names(irisDT))]

Groupby

Page 21: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

21

DT[i,j,by]

DT[1,3,2]

Orderofexecution

Page 22: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

22

Openthe fileex2.R

- Count the number ofdays permonth

- Calculate the average Windspeedby month for only those days that havean

ozone value

- Calculate the mean temperature for the odd and evendays for each month

Exercise 2

Page 23: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

23

specialoperator::=

- updatesadata.table inplace (by reference)

- can be used to:

o updateexisting column(s)

o add newcolumn(s)

o deletecolumn(s)

Updating,adding &deleting variables

Page 24: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

24

specialoperator::=

irisDT[,Sepal.Length :=Sepal.Length *2]

irisDT[,`:=`(Sepal.Length =Sepal.Length *2,Petal.Width =Petal.Width /2)]

Updatingvariables

Page 25: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

25

specialoperator::=

irisDT[,Sepal.Length :=Sepal.Length *uniqueN(Sepal.Width)/.N,by =Species]

irisDT[,`:=`(Sepal.Length =Sepal.Length *uniqueN(Sepal.Width),Petal.Width =Petal.Width /.N)

,by =Species]

Updatingvariablesby group

Page 26: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

26

specialoperator::= specialsymbol:.I

irisDT[,rownumber :=.I]

irisDT[,Sepal.Area :=Sepal.Length *Sepal.Width]

irisDT[,`:=`(Sepal.Area =Sepal.Length *Sepal.Width,Petal.Area =Petal.Length *Petal.Width)]

Adding variables

Page 27: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

27

specialoperator::=

irisDT[,Total.Sepal.Area :=sum(Sepal.Area),by =Species]

irisDT[,`:=`(Total.Sepal.Area =sum(Sepal.Area),Total.Petal.Area =sum(Petal.Area))

,by =Species]

Adding variablesby group

Page 28: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

28

specialoperator::=

irisDT[,Sepal.Length :=NULL]

irisDT[,(1:4):=NULL]

irisDT[,grep("Length",names(irisDT)) :=NULL]

Deleting variables

Page 29: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

29

Openthe fileex3.R

- Changethe Windcolumnfrom miles perhour to kilometersperhour

(1mph =1.6kmh)

- Calculate anewchill variable (Wind*Temperature)

- Calculate the average chill by month and add that asanewvariable

- Remove the Ozone and Solar.R columns

Exercise 3

Page 30: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

30

- subsetrows DT[id ==“01”,]

- selectcolumns DT[,val1]

- subsetrows &select columns DT[id ==“01”,val1]

Joining datasets

Page 31: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

31

DT[i,j,by]

Joining datasets

Which rows? What to do? Grouped by what?

- vectorofrownumbers- logical vector- another data.table

- summarizing- updatingvariable(s)- adding variable(s)

- one ormorecolumns- onthe fly grouping var(s)

Page 32: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

32

Example data

irisDT <- copy(iris)setDT(irisDT)

irisH <- data.table(Species=c("setosa","versicolor","virginica"),Species.full =c("Irissetosa","Irisversicolor","Iris virginica"),height =1:3,soil =c("mud","rock","sand"))

Joining datasets

Page 33: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

33

syntax:DT[i,on,j,by]

irisDT[irisH,on=.(Species)]

irisDT[irisH,on="Species"]

irisDT[irisH,on=.(Species=Spec,other_col)]

Joining datasets

Page 34: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

34

syntax:DT[i,on,j,by]

irisDT[irisH,on=.(Species),Species.full :=Species.full]

irisDT[irisH,on=.(Species),`:=`(Species.full =Species.full,height =height,soil =soil)]

irisDT[irisH,on=.(Species),`:=`(Species.full =i.Species.full,height =i.height,soil =i.soil)]

Joining datasets

Page 35: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

35

syntax:DT[i,on,j,by]

like%>% from the tidyverse,you can also chaindata.table operationstogether

irisDT[… ][… ][… ]

irisDT[irisH,on=.(Species),Species.full :=Species.full][,median(Sepal.Length),by =Species.full]

Joining &chaining

Page 36: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

36

Openthe fileex4.R

- Use ajoin to add the month namefrom 'airmonths' to 'air'

- Use ajoin to add both the month nameand the month abbreviation from

'airmonths' to 'air'

- Use ajoin to add the month namefrom 'airmonths' to 'air’;then use chaining to

calculate the median Windspeedfor each month name

Exercise 4

Page 37: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

37

From wide to long: irisMelted <- melt(irisDT,id ="Species")

melt(data,id.vars,measure.vars,variable.name ="variable",value.name ="value",na.rm =FALSE,variable.factor =TRUE,value.factor =FALSE)

Seealso:?melt

Reshaping data

Page 38: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

38

From longto wide: dcast(irisMelted,Species~variable)

dcast(data,formula,fun.aggregate =NULL,sep="_",...,margins =NULL,subset=NULL,fill =NULL,drop=TRUE,value.var =guess(data))

Seealso:?dcast

Reshaping data

Page 39: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

39

morejoins: non-equi joins +rollingjoins

morespecialsymbols: .BY +.GRP

specialgrouping functions: rowid +rleid

set*functions: setkey +setorder +setcolorder +setnames +…..

and evenmore: frank +shift +CJ +tstrsplit +…..

What else isthere to discover?

Page 40: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

40

Overview ofgettingstarted vignettes

Datacamp’s data.tablecourse (paid)

StackOverflow [data.table]tag (>7700questions)

Wantto learn more?

Page 41: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

41

Thank you for your attention!

TheEnd