ubc stat545 2014 cm009 overview of data aggregation

21
STAT 545A Class meeting 009 Intro to data aggregation web companion: STAT 545 web home > Syllabus > cm009 Wednesday, October 1, 2014

Upload: jennifer-bryan

Post on 30-Jun-2015

942 views

Category:

Data & Analytics


4 download

DESCRIPTION

Lecture slides from UBC STAT545 2014. Not a stand-alone document. http://stat545-ubc.github.io

TRANSCRIPT

Page 3: UBC STAT545 2014 Cm009 overview of data aggregation

Data aggregation

Page 4: UBC STAT545 2014 Cm009 overview of data aggregation

What is data aggregation?

Take some data

Split it up into pieces

Apply a computation to each piece

Combine the results back together again

Page 5: UBC STAT545 2014 Cm009 overview of data aggregation

What is data aggregation?

Take some data

Split it up into pieces

Apply a computation to each piece

Combine the results back together again

Page 6: UBC STAT545 2014 Cm009 overview of data aggregation

The split-apply-combine strategy for data analysis.Hadley Wickham.Journal of Statistical Software, vol. 40, no. 1, pp. 1–29, 2011.http://www.jstatsoft.org/v40/i01/paper

JSS Journal of Statistical Software

April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/

The Split-Apply-Combine Strategy for DataAnalysis

Hadley WickhamRice University

Abstract

Many data analysis problems involve the application of a split-apply-combine strategy,where you break up a big problem into manageable pieces, operate on each piece inde-pendently and then put all the pieces back together. This insight gives rise to a new R

package that allows you to smoothly apply this strategy, without having to worry aboutthe type of structure in which your data is stored.

The paper includes two case studies showing how these insights make it easier to workwith batting records for veteran baseball players and a large 3d array of spatio-temporalozone measurements.

Keywords: R, apply, split, data analysis.

1. Introduction

What do we do when we analyze data? What are common actions and what are commonmistakes? Given the importance of this activity in statistics, there is remarkably little researchon how data analysis happens. This paper attempts to remedy a very small part of that lack bydescribing one common data analysis pattern: Split-apply-combine. You see the split-apply-combine strategy whenever you break up a big problem into manageable pieces, operate oneach piece independently and then put all the pieces back together. This crops up in all stagesof an analysis:

During data preparation, when performing group-wise ranking, standardization, or nor-malization, or in general when creating new variables that are most easily calculated ona per-group basis.

When creating summaries for display or analysis, for example, when calculating marginalmeans, or conditioning a table of counts by dividing out group sums.

Page 7: UBC STAT545 2014 Cm009 overview of data aggregation

It turns out that these things matter:

• how you specify the pieces to split the data into

• how nicely the results are re-combined

Page 8: UBC STAT545 2014 Cm009 overview of data aggregation

Base R has long had the capability to do data aggregation

the “apply” functions

but these functions are not well-harmonized re: how to specify the pieces

nor do they return the results in a highly usable or predictable form

Page 9: UBC STAT545 2014 Cm009 overview of data aggregation

Consequence of these shortcomings:

many long-time useRs knew they should be using the “apply” functions

but they did not actually do it

because it’s kind of painful and annoying

Page 10: UBC STAT545 2014 Cm009 overview of data aggregation

I will give you the Big Picture which includes these base R approaches

But I highly recommend the plyr package for most of your data aggregation work

• Better interface, better return values

new arrival dplyr is also relevant, but I don’t think it replaces plyr

http://cran.rstudio.com/web/packages/plyr/http://cran.rstudio.com/web/packages/dplyr/

Page 11: UBC STAT545 2014 Cm009 overview of data aggregation

♣♣♥♦♦♦

How do you want to split your data into pieces?

rows or columns of a matrix or data.frame

groups of observations induced by levels of ≥1 factor(s)

elements of a list

This determines how you will attack data aggregation.

Page 12: UBC STAT545 2014 Cm009 overview of data aggregation

chunks are ... relevant functions

rows, columns, etc. of matrix or array

apply()

components of a list(remember data.frames are lists and variables are components!)

sapply(),lapply()

groups of observations induced by levels of ≥ 1 factor(s)

aggregate()tapply()by()split() + [sl]apply()

How to do <sthg> for various pieces of a dataset ... using only base R functions

Page 13: UBC STAT545 2014 Cm009 overview of data aggregation

http://plyr.had.co.nz

Page 14: UBC STAT545 2014 Cm009 overview of data aggregation

Journal of Statistical Software 5

R> models <- dlply(ozonedf, .(lat, long), deseasf_df)

R> deseas <- ldply(models, resid)

dlply takes a data frame and returns a list, and ldply does the opposite: It takes a list andreturns a data frame. Compare this code to the code needed when the data was stored in anarray.

The following section describes the plyr functions in more detail. If your interest has beenwhetted by this example, you might want to skip ahead to Section 5.2 to learn more aboutthis example and see some plots of the data before and after removing the seasonal e↵ects.

3. Usage

Table 2 lists the basic set of plyr functions. Each function is named according to the type ofinput it accepts and the type of output it produces: a = array, d = data frame, l = list, and_ means the output is discarded. The input type determines how the big data structure isbroken apart into small pieces, described in Section 3.1; and the output type determines howthe pieces are joined back together again, described in Section 3.2.

The e↵ects of the input and outputs types are orthogonal, so instead of having to learn all12 functions individually, it is su�cient to learn the three types of input and the four typesof output. For this reason, we use the notation d*ply for functions with common input, acomplete row of Table 2, and *dply for functions with common output, a column of Table 2.

The functions have either two or three main arguments, depending on the type of input:

a*ply(.data, .margins, .fun, ..., .progress = "none")

d*ply(.data, .variables, .fun, ..., .progress = "none")

l*ply(.data, .fun, ..., .progress = "none")

The first argument is the .data which will be split up, processed and recombined. The secondargument, .variables or .margins, describes how to split up the input into pieces. The thirdargument, .fun, is the processing function, and is applied to each piece in turn. All furtherarguments are passed on to the processing function. If you omit .fun the individual pieceswill not be modified, but the entire data structure will be converted from one type to another.The .progress argument controls display of a progress bar, and is described at the end ofSection 4.

Note that all arguments start with “.”. This prevents name clashes with the arguments ofthe processing function, and helps to visually delineate arguments that control the repetition

XXXXXXXXXXXInputOutput

Array Data frame List Discarded

Array aaply adply alply a_ply

Data frame daply ddply dlply d_ply

List laply ldply llply l_ply

Table 2: The 12 key functions of plyr. Arrays include matrices and vectors as special cases.

Journal of Statistical Software 5

R> models <- dlply(ozonedf, .(lat, long), deseasf_df)

R> deseas <- ldply(models, resid)

dlply takes a data frame and returns a list, and ldply does the opposite: It takes a list andreturns a data frame. Compare this code to the code needed when the data was stored in anarray.

The following section describes the plyr functions in more detail. If your interest has beenwhetted by this example, you might want to skip ahead to Section 5.2 to learn more aboutthis example and see some plots of the data before and after removing the seasonal e↵ects.

3. Usage

Table 2 lists the basic set of plyr functions. Each function is named according to the type ofinput it accepts and the type of output it produces: a = array, d = data frame, l = list, and_ means the output is discarded. The input type determines how the big data structure isbroken apart into small pieces, described in Section 3.1; and the output type determines howthe pieces are joined back together again, described in Section 3.2.

The e↵ects of the input and outputs types are orthogonal, so instead of having to learn all12 functions individually, it is su�cient to learn the three types of input and the four typesof output. For this reason, we use the notation d*ply for functions with common input, acomplete row of Table 2, and *dply for functions with common output, a column of Table 2.

The functions have either two or three main arguments, depending on the type of input:

a*ply(.data, .margins, .fun, ..., .progress = "none")

d*ply(.data, .variables, .fun, ..., .progress = "none")

l*ply(.data, .fun, ..., .progress = "none")

The first argument is the .data which will be split up, processed and recombined. The secondargument, .variables or .margins, describes how to split up the input into pieces. The thirdargument, .fun, is the processing function, and is applied to each piece in turn. All furtherarguments are passed on to the processing function. If you omit .fun the individual pieceswill not be modified, but the entire data structure will be converted from one type to another.The .progress argument controls display of a progress bar, and is described at the end ofSection 4.

Note that all arguments start with “.”. This prevents name clashes with the arguments ofthe processing function, and helps to visually delineate arguments that control the repetition

XXXXXXXXXXXInputOutput

Array Data frame List Discarded

Array aaply adply alply a_ply

Data frame daply ddply dlply d_ply

List laply ldply llply l_ply

Table 2: The 12 key functions of plyr. Arrays include matrices and vectors as special cases.

Page 15: UBC STAT545 2014 Cm009 overview of data aggregation

Journal of Statistical Software 5

R> models <- dlply(ozonedf, .(lat, long), deseasf_df)

R> deseas <- ldply(models, resid)

dlply takes a data frame and returns a list, and ldply does the opposite: It takes a list andreturns a data frame. Compare this code to the code needed when the data was stored in anarray.

The following section describes the plyr functions in more detail. If your interest has beenwhetted by this example, you might want to skip ahead to Section 5.2 to learn more aboutthis example and see some plots of the data before and after removing the seasonal e↵ects.

3. Usage

Table 2 lists the basic set of plyr functions. Each function is named according to the type ofinput it accepts and the type of output it produces: a = array, d = data frame, l = list, and_ means the output is discarded. The input type determines how the big data structure isbroken apart into small pieces, described in Section 3.1; and the output type determines howthe pieces are joined back together again, described in Section 3.2.

The e↵ects of the input and outputs types are orthogonal, so instead of having to learn all12 functions individually, it is su�cient to learn the three types of input and the four typesof output. For this reason, we use the notation d*ply for functions with common input, acomplete row of Table 2, and *dply for functions with common output, a column of Table 2.

The functions have either two or three main arguments, depending on the type of input:

a*ply(.data, .margins, .fun, ..., .progress = "none")

d*ply(.data, .variables, .fun, ..., .progress = "none")

l*ply(.data, .fun, ..., .progress = "none")

The first argument is the .data which will be split up, processed and recombined. The secondargument, .variables or .margins, describes how to split up the input into pieces. The thirdargument, .fun, is the processing function, and is applied to each piece in turn. All furtherarguments are passed on to the processing function. If you omit .fun the individual pieceswill not be modified, but the entire data structure will be converted from one type to another.The .progress argument controls display of a progress bar, and is described at the end ofSection 4.

Note that all arguments start with “.”. This prevents name clashes with the arguments ofthe processing function, and helps to visually delineate arguments that control the repetition

XXXXXXXXXXXInputOutput

Array Data frame List Discarded

Array aaply adply alply a_ply

Data frame daply ddply dlply d_ply

List laply ldply llply l_ply

Table 2: The 12 key functions of plyr. Arrays include matrices and vectors as special cases.

a*ply(.data, .margins, .fun)

something rectangular 1 ⇒ pieces are rows

2 ⇒ pieces are columns

function to apply to each

piece

what you want back, i.e. a, d, l, nothing

Page 16: UBC STAT545 2014 Cm009 overview of data aggregation

Journal of Statistical Software 5

R> models <- dlply(ozonedf, .(lat, long), deseasf_df)

R> deseas <- ldply(models, resid)

dlply takes a data frame and returns a list, and ldply does the opposite: It takes a list andreturns a data frame. Compare this code to the code needed when the data was stored in anarray.

The following section describes the plyr functions in more detail. If your interest has beenwhetted by this example, you might want to skip ahead to Section 5.2 to learn more aboutthis example and see some plots of the data before and after removing the seasonal e↵ects.

3. Usage

Table 2 lists the basic set of plyr functions. Each function is named according to the type ofinput it accepts and the type of output it produces: a = array, d = data frame, l = list, and_ means the output is discarded. The input type determines how the big data structure isbroken apart into small pieces, described in Section 3.1; and the output type determines howthe pieces are joined back together again, described in Section 3.2.

The e↵ects of the input and outputs types are orthogonal, so instead of having to learn all12 functions individually, it is su�cient to learn the three types of input and the four typesof output. For this reason, we use the notation d*ply for functions with common input, acomplete row of Table 2, and *dply for functions with common output, a column of Table 2.

The functions have either two or three main arguments, depending on the type of input:

a*ply(.data, .margins, .fun, ..., .progress = "none")

d*ply(.data, .variables, .fun, ..., .progress = "none")

l*ply(.data, .fun, ..., .progress = "none")

The first argument is the .data which will be split up, processed and recombined. The secondargument, .variables or .margins, describes how to split up the input into pieces. The thirdargument, .fun, is the processing function, and is applied to each piece in turn. All furtherarguments are passed on to the processing function. If you omit .fun the individual pieceswill not be modified, but the entire data structure will be converted from one type to another.The .progress argument controls display of a progress bar, and is described at the end ofSection 4.

Note that all arguments start with “.”. This prevents name clashes with the arguments ofthe processing function, and helps to visually delineate arguments that control the repetition

XXXXXXXXXXXInputOutput

Array Data frame List Discarded

Array aaply adply alply a_ply

Data frame daply ddply dlply d_ply

List laply ldply llply l_ply

Table 2: The 12 key functions of plyr. Arrays include matrices and vectors as special cases.

d*ply(.data, .variables, .fun)

data.frame split by levels of these factor(s)

function to apply to each

piece

what you want back, i.e. a, d, l, nothing

Page 17: UBC STAT545 2014 Cm009 overview of data aggregation

Journal of Statistical Software 5

R> models <- dlply(ozonedf, .(lat, long), deseasf_df)

R> deseas <- ldply(models, resid)

dlply takes a data frame and returns a list, and ldply does the opposite: It takes a list andreturns a data frame. Compare this code to the code needed when the data was stored in anarray.

The following section describes the plyr functions in more detail. If your interest has beenwhetted by this example, you might want to skip ahead to Section 5.2 to learn more aboutthis example and see some plots of the data before and after removing the seasonal e↵ects.

3. Usage

Table 2 lists the basic set of plyr functions. Each function is named according to the type ofinput it accepts and the type of output it produces: a = array, d = data frame, l = list, and_ means the output is discarded. The input type determines how the big data structure isbroken apart into small pieces, described in Section 3.1; and the output type determines howthe pieces are joined back together again, described in Section 3.2.

The e↵ects of the input and outputs types are orthogonal, so instead of having to learn all12 functions individually, it is su�cient to learn the three types of input and the four typesof output. For this reason, we use the notation d*ply for functions with common input, acomplete row of Table 2, and *dply for functions with common output, a column of Table 2.

The functions have either two or three main arguments, depending on the type of input:

a*ply(.data, .margins, .fun, ..., .progress = "none")

d*ply(.data, .variables, .fun, ..., .progress = "none")

l*ply(.data, .fun, ..., .progress = "none")

The first argument is the .data which will be split up, processed and recombined. The secondargument, .variables or .margins, describes how to split up the input into pieces. The thirdargument, .fun, is the processing function, and is applied to each piece in turn. All furtherarguments are passed on to the processing function. If you omit .fun the individual pieceswill not be modified, but the entire data structure will be converted from one type to another.The .progress argument controls display of a progress bar, and is described at the end ofSection 4.

Note that all arguments start with “.”. This prevents name clashes with the arguments ofthe processing function, and helps to visually delineate arguments that control the repetition

XXXXXXXXXXXInputOutput

Array Data frame List Discarded

Array aaply adply alply a_ply

Data frame daply ddply dlply d_ply

List laply ldply llply l_ply

Table 2: The 12 key functions of plyr. Arrays include matrices and vectors as special cases.

l*ply(.data, .fun)

list function to apply to each

element

what you want back, i.e. a, d, l, nothing

Page 18: UBC STAT545 2014 Cm009 overview of data aggregation

Journal of Statistical Software 5

R> models <- dlply(ozonedf, .(lat, long), deseasf_df)

R> deseas <- ldply(models, resid)

dlply takes a data frame and returns a list, and ldply does the opposite: It takes a list andreturns a data frame. Compare this code to the code needed when the data was stored in anarray.

The following section describes the plyr functions in more detail. If your interest has beenwhetted by this example, you might want to skip ahead to Section 5.2 to learn more aboutthis example and see some plots of the data before and after removing the seasonal e↵ects.

3. Usage

Table 2 lists the basic set of plyr functions. Each function is named according to the type ofinput it accepts and the type of output it produces: a = array, d = data frame, l = list, and_ means the output is discarded. The input type determines how the big data structure isbroken apart into small pieces, described in Section 3.1; and the output type determines howthe pieces are joined back together again, described in Section 3.2.

The e↵ects of the input and outputs types are orthogonal, so instead of having to learn all12 functions individually, it is su�cient to learn the three types of input and the four typesof output. For this reason, we use the notation d*ply for functions with common input, acomplete row of Table 2, and *dply for functions with common output, a column of Table 2.

The functions have either two or three main arguments, depending on the type of input:

a*ply(.data, .margins, .fun, ..., .progress = "none")

d*ply(.data, .variables, .fun, ..., .progress = "none")

l*ply(.data, .fun, ..., .progress = "none")

The first argument is the .data which will be split up, processed and recombined. The secondargument, .variables or .margins, describes how to split up the input into pieces. The thirdargument, .fun, is the processing function, and is applied to each piece in turn. All furtherarguments are passed on to the processing function. If you omit .fun the individual pieceswill not be modified, but the entire data structure will be converted from one type to another.The .progress argument controls display of a progress bar, and is described at the end ofSection 4.

Note that all arguments start with “.”. This prevents name clashes with the arguments ofthe processing function, and helps to visually delineate arguments that control the repetition

XXXXXXXXXXXInputOutput

Array Data frame List Discarded

Array aaply adply alply a_ply

Data frame daply ddply dlply d_ply

List laply ldply llply l_ply

Table 2: The 12 key functions of plyr. Arrays include matrices and vectors as special cases.

Journal of Statistical Software 5

R> models <- dlply(ozonedf, .(lat, long), deseasf_df)

R> deseas <- ldply(models, resid)

dlply takes a data frame and returns a list, and ldply does the opposite: It takes a list andreturns a data frame. Compare this code to the code needed when the data was stored in anarray.

The following section describes the plyr functions in more detail. If your interest has beenwhetted by this example, you might want to skip ahead to Section 5.2 to learn more aboutthis example and see some plots of the data before and after removing the seasonal e↵ects.

3. Usage

Table 2 lists the basic set of plyr functions. Each function is named according to the type ofinput it accepts and the type of output it produces: a = array, d = data frame, l = list, and_ means the output is discarded. The input type determines how the big data structure isbroken apart into small pieces, described in Section 3.1; and the output type determines howthe pieces are joined back together again, described in Section 3.2.

The e↵ects of the input and outputs types are orthogonal, so instead of having to learn all12 functions individually, it is su�cient to learn the three types of input and the four typesof output. For this reason, we use the notation d*ply for functions with common input, acomplete row of Table 2, and *dply for functions with common output, a column of Table 2.

The functions have either two or three main arguments, depending on the type of input:

a*ply(.data, .margins, .fun, ..., .progress = "none")

d*ply(.data, .variables, .fun, ..., .progress = "none")

l*ply(.data, .fun, ..., .progress = "none")

The first argument is the .data which will be split up, processed and recombined. The secondargument, .variables or .margins, describes how to split up the input into pieces. The thirdargument, .fun, is the processing function, and is applied to each piece in turn. All furtherarguments are passed on to the processing function. If you omit .fun the individual pieceswill not be modified, but the entire data structure will be converted from one type to another.The .progress argument controls display of a progress bar, and is described at the end ofSection 4.

Note that all arguments start with “.”. This prevents name clashes with the arguments ofthe processing function, and helps to visually delineate arguments that control the repetition

XXXXXXXXXXXInputOutput

Array Data frame List Discarded

Array aaply adply alply a_ply

Data frame daply ddply dlply d_ply

List laply ldply llply l_ply

Table 2: The 12 key functions of plyr. Arrays include matrices and vectors as special cases.

the most useful one!

Page 19: UBC STAT545 2014 Cm009 overview of data aggregation

ddply(.data, .variables, .fun = NULL)

apply this function to each piece ...

Take this data.frame ...

divide it into pieces, i.e. smaller data.frames, based on this factor and ...

... glue the results back together andreturn as a data.frame

Page 20: UBC STAT545 2014 Cm009 overview of data aggregation

ddply(gDat, country, le_lin_fit)

apply this function to each chunk ...

Take this data.frame ...

... glue the results back together andreturn as a data.frame

We wrote the le_lin_fit() function to regress life expectancy on time.This is how we apply that function to the data for each country in the Gapminder data.

divide it into pieces, i.e. smaller data.frames, based on this factor and ...

Page 21: UBC STAT545 2014 Cm009 overview of data aggregation

there are dplyr-ish ways to do what plyr::ddply() and plyr::dlply() do:

• specify groups via group_by()

• try to achieve your goals with summarize(...) and/or window functions

• if not possible ... use do()

So far, I still think plyr::d[dl]ply() is more natural thandplyr + group_by() + do()

Not sure if this is legit or I am just resistant to change