data.table tutorial uros2018r-project.ro/conference2018/presentations/data... · 3 developers: matt...

Post on 23-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

JaapWalhoutSeptember12,2018

tutorialuRos 2018

2

1. Introduction

2. Fast read &write

3. Syntax

4. Basicoperations(filteringrows &selecting columns)

5. Summarizing

6. Adding /updatingvariables

7. Joining datasets

8. Reshaping data

Specialsymbols:.N +.SD +.I Specialoperator::=

Overview

3

Developers:MattDowle,Arun Srinivasan,JanGorecki,MichaelChirico,Pasha Stetsenko,TomShort,SteveLianoglou,EduardAntonyan,MarkusBonsch,Hugh Parsonage

Since 2006 onCRAN,>35 releasesso far

678 packagesimport/depend/suggest data.table (543 CRAN+135 Bioconductor)

Homepage:http://r-datatable.com

Introduction

4

Why use data.table?

Pros:- speed- memoryefficiency- coding flexibility- non-equi joins

Cons:- ‘different’syntax

Introduction

5

50million rows /10columns/± 4GB

fread("datafile.csv")

expr time

data.table_fread 15.6 readr_read_csv 92.6 base_read.csv 559.9

fwrite(DT,"datafile.csv")

expr time

data.table_fread 32.6 readr_read_csv 102.2 base_read.csv 201.9

timesinseconds

Fast read &write

6

Threemain enhancements:

1. Columnnames can be used asvariables inside [….]

2. Because they arevariables,wecan use columnnames

to calculate stuffinside [….]

3. Anadditional grouping argument:by

Syntax:data.table ==enhanced data.frame

7

Columnar datastructure:2D– rows and columns

- subset rows df[df$id =="01",]

- select columns df[,"val1"]

- subset rows &select columns df[df$id =="01","val1"]

- that’s about it ….

Syntax:dataframerefresher

8

DT[i,j,by]

Syntax:general form

Which rows? What to do? Grouped by what?

- vectorofrownumbers- logical vector- another data.table

- summarizing- updatingvariable(s)- adding variable(s)

- one ormorecolumns- onthe fly grouping var(s)

9

DT[i,j,by]

data.table: i j bySQL: where select | update group by

Syntax:general form

10

build iniris dataset:

irisDT <- as.data.table(iris)

Example data

11

syntax:DT[i,j,by]

subset rows

select columns

subset rows &select columns

irisDT[Species=="setosa",]

irisDT[,Petal.Width]

irisDT[,.(Petal.Width)]

irisDT[Species=="setosa",Petal.Width]

irisDT[Species=="setosa",.(Petal.Width)]

Filteringrows &selecting columns

12

subset rows irisDT[between(Petal.Width,1,1)]

irisDT[Petal.Width %between%c(1,2)]

select columns irisDT[,.(Species,Sepal.Length)]

Filteringrows &selecting columns

13

Openthe fileex1.R

subset rows : getonly the rows with aday lower than or

equal to 10

select columns : selectonly the Month columnand make

sure you getadata.table back

subset rows &select columns : getonly the Wind&Tempcolumnsfor the

rows with aday higher than 5and lower

than orequal to 10

Exercise 1

14

1. Counts

2. Aggregating

3. Groupby

Summarizing

15

syntax:DT[i,j,by]

count irisDT[Species=="setosa",.N]

count distinct irisDT[,.uniqueN(Species)]

irisDT[Petal.Width <0.9,. uniqueN(Species)]

uniqueN(irisDT,by ="Species")

Counts

16

syntax:DT[i,j,by]

Simpleaggregation: irisDT[,.(count =.N,average =mean(Petal.Width))]

Including filtering: irisDT[Petal.Width <0.9,.(count =.N,average =mean(Petal.Width))]

Aggregating

17

syntax:DT[i,j,by]

irisDT[,.N,by =Species]

irisDT[,.(average =mean(Petal.Width)),by =Species]

irisDT[Sepal.Length <5.3,.(average =mean(Petal.Width)),by =Species]

irisDT[,.(average =mean(Petal.Width)),by =.(Species,logi =Sepal.Length <5.3)]

Groupby

18

specialsymbol:.SD

SD=SubsetofData

- adata.table by itself

- holds dataofcurrent goup asdefined inby

- when noby,.SDapplies to whole data.table

- allows for calculations onmultiplecolumns

Groupby

19

specialsymbol:.SD

irisDT[,lapply(.SD,mean),by =Species]

irisDT[Sepal.Length <5.3,lapply(.SD,mean),by =Species]

Groupby

20

specialsymbol:.SD

specialsymbol:.SDcols

irisDT[,lapply(.SD,mean),by =Species,.SDcols =1:2]

irisDT[,lapply(.SD,mean),by =Species,.SDcols =grep("Length",names(irisDT))]

Groupby

21

DT[i,j,by]

DT[1,3,2]

Orderofexecution

22

Openthe fileex2.R

- Count the number ofdays permonth

- Calculate the average Windspeedby month for only those days that havean

ozone value

- Calculate the mean temperature for the odd and evendays for each month

Exercise 2

23

specialoperator::=

- updatesadata.table inplace (by reference)

- can be used to:

o updateexisting column(s)

o add newcolumn(s)

o deletecolumn(s)

Updating,adding &deleting variables

24

specialoperator::=

irisDT[,Sepal.Length :=Sepal.Length *2]

irisDT[,`:=`(Sepal.Length =Sepal.Length *2,Petal.Width =Petal.Width /2)]

Updatingvariables

25

specialoperator::=

irisDT[,Sepal.Length :=Sepal.Length *uniqueN(Sepal.Width)/.N,by =Species]

irisDT[,`:=`(Sepal.Length =Sepal.Length *uniqueN(Sepal.Width),Petal.Width =Petal.Width /.N)

,by =Species]

Updatingvariablesby group

26

specialoperator::= specialsymbol:.I

irisDT[,rownumber :=.I]

irisDT[,Sepal.Area :=Sepal.Length *Sepal.Width]

irisDT[,`:=`(Sepal.Area =Sepal.Length *Sepal.Width,Petal.Area =Petal.Length *Petal.Width)]

Adding variables

27

specialoperator::=

irisDT[,Total.Sepal.Area :=sum(Sepal.Area),by =Species]

irisDT[,`:=`(Total.Sepal.Area =sum(Sepal.Area),Total.Petal.Area =sum(Petal.Area))

,by =Species]

Adding variablesby group

28

specialoperator::=

irisDT[,Sepal.Length :=NULL]

irisDT[,(1:4):=NULL]

irisDT[,grep("Length",names(irisDT)) :=NULL]

Deleting variables

29

Openthe fileex3.R

- Changethe Windcolumnfrom miles perhour to kilometersperhour

(1mph =1.6kmh)

- Calculate anewchill variable (Wind*Temperature)

- Calculate the average chill by month and add that asanewvariable

- Remove the Ozone and Solar.R columns

Exercise 3

30

- subsetrows DT[id ==“01”,]

- selectcolumns DT[,val1]

- subsetrows &select columns DT[id ==“01”,val1]

Joining datasets

31

DT[i,j,by]

Joining datasets

Which rows? What to do? Grouped by what?

- vectorofrownumbers- logical vector- another data.table

- summarizing- updatingvariable(s)- adding variable(s)

- one ormorecolumns- onthe fly grouping var(s)

32

Example data

irisDT <- copy(iris)setDT(irisDT)

irisH <- data.table(Species=c("setosa","versicolor","virginica"),Species.full =c("Irissetosa","Irisversicolor","Iris virginica"),height =1:3,soil =c("mud","rock","sand"))

Joining datasets

33

syntax:DT[i,on,j,by]

irisDT[irisH,on=.(Species)]

irisDT[irisH,on="Species"]

irisDT[irisH,on=.(Species=Spec,other_col)]

Joining datasets

34

syntax:DT[i,on,j,by]

irisDT[irisH,on=.(Species),Species.full :=Species.full]

irisDT[irisH,on=.(Species),`:=`(Species.full =Species.full,height =height,soil =soil)]

irisDT[irisH,on=.(Species),`:=`(Species.full =i.Species.full,height =i.height,soil =i.soil)]

Joining datasets

35

syntax:DT[i,on,j,by]

like%>% from the tidyverse,you can also chaindata.table operationstogether

irisDT[… ][… ][… ]

irisDT[irisH,on=.(Species),Species.full :=Species.full][,median(Sepal.Length),by =Species.full]

Joining &chaining

36

Openthe fileex4.R

- Use ajoin to add the month namefrom 'airmonths' to 'air'

- Use ajoin to add both the month nameand the month abbreviation from

'airmonths' to 'air'

- Use ajoin to add the month namefrom 'airmonths' to 'air’;then use chaining to

calculate the median Windspeedfor each month name

Exercise 4

37

From wide to long: irisMelted <- melt(irisDT,id ="Species")

melt(data,id.vars,measure.vars,variable.name ="variable",value.name ="value",na.rm =FALSE,variable.factor =TRUE,value.factor =FALSE)

Seealso:?melt

Reshaping data

38

From longto wide: dcast(irisMelted,Species~variable)

dcast(data,formula,fun.aggregate =NULL,sep="_",...,margins =NULL,subset=NULL,fill =NULL,drop=TRUE,value.var =guess(data))

Seealso:?dcast

Reshaping data

39

morejoins: non-equi joins +rollingjoins

morespecialsymbols: .BY +.GRP

specialgrouping functions: rowid +rleid

set*functions: setkey +setorder +setcolorder +setnames +…..

and evenmore: frank +shift +CJ +tstrsplit +…..

What else isthere to discover?

40

Overview ofgettingstarted vignettes

Datacamp’s data.tablecourse (paid)

StackOverflow [data.table]tag (>7700questions)

Wantto learn more?

41

Thank you for your attention!

TheEnd

top related