jim holtman jholtman@gmail - meetupfiles.meetup.com/1736007/r does pivot tablev8.pdf ·...

Does Pivot Tables and More

Jim Holtman [email protected]

There were several papers at CMG2008, and previous conferences, that got me thinking about other ways that R can help with the analysis and visualization of performance data. There were a couple of sessions that made use of pivot tables in Excel to help analyze data. There was also a paper that referenced “sparklines” as a method of visualizing data. This paper will show how R can be used to do these, and other, procedures that will enhance your ability to better analyze per-formance data.

1 Overview

At CMG many of the papers describe how various per-formance metrics about a system can be analyzed. There are a number of different ways that this data is collected (proprietary vendor code, open source, user written scripts, etc.). Once this data is collected, there are again a variety of vendor, open source and user written procedures to process this information. Many of these are very flexible in providing a user with ways of customizing the subset of data to be analyzed, the algorithms to analyze the data and the format for the presentation of this data.

I have used many of these tools in the past and still rely on them. Like most practitioners of computer per-formance analysis, I have my own tool chest of things that make by life easier. These include Perl for pre-processing/formatting unstructured data from log files, standard text editors for examining/changing data, Ex-cel for quick looks at the data and for communicating results to others who are used to working with Excel,

and of course R which is my favorite because of its

versatility for analysis and graphical presentation of the results.

R is an open source language and

environment for statistical processing. It is based on the S language originally developed at Bell Labs by John Chambers who won the ACM award in 1998 for the language. It easily handles data files with millions of records (e.g., transaction response times), and compute, for example, the average response time and create a histo-gram of the response times in less than a couple of seconds.

The graphics available in R for data visualization are

very rich and flexible. Being able to slice/dice your data and then visualize it in various ways allows you to quickly see patterns in your data that just numbers in a table will not reveal.

It is very well supported through an active user‟s group and there are over 85 books available covering the

areas that R has been used for. I have used it for the

last 25 years for doing computer performance analy-sis.

To quickly find R on the internet, just type “R” into

Google and it will be the first hit. The links will provide

an overview of R. There is a learning curve to it, but it

is well worth the effort if you are serious about perfor-mance analysis. The presentation slides have a “10

minute R workshop” which provides an overview of R.

2 Pivot Tables

John Van Wagenen‟s paper “Pivot Tables/Charts – Magic Beans Without Living in a Fairy Tale” at CMG 2008 gave a very good overview of how pivot tables can help in analyzing, and visualizing, data that a per-formance analyst typically works with.

Pivot tables allow an analyst to slice/dice the data in various ways, and to create aggregations of the data by various classifications. Pivot tables are typically associated with Excel, but the same information can

be constructed by a variety of packages. For example, SQL statements can be used to “group” the data by various criteria and then summarize the results. Most of the vendor supplied packages have similar capabili-ties.

John gave his permission to use the data from his pa-per so that I can illustrate that the results are similar

when using R. The spreadsheet that he shared with

Figure 1 - Sample 15 Minute Data From Excel

me had some different data, but it did have the pivot tables generated from this data.

2.1 Pivot Tables

The first example is from 15 minute data that was collected on system utilization. Figure 1 is a sample of the first entries in the Excel spread-

sheet. To read this data into R, I converted the

spreadsheet to a CSV file. R can directly read

from Excel spreadsheets, but it is easier to illu-strate the processing if we assume the data is in a file, since that is probably where most data is located. The resulting CSV is shown in Figure 2.

In Excel a pivot table was created summarizing the CPU_HOUR over each DAY, HOUR and MIN, and generating total on each of the breaks. The Excel pivot table is shown in Fig-ure 3. You can read John‟s paper to see how to setup the pivot table from the given input.

To create a similar output in R, the script is

shown in Figure 4. The first statement („read.csv‟) calls a function that will read in a CSV (comma separated variable) file. The de-fault parameters are that the separator is a comma and that there is a header line in the file that defines the names of the columns when the data is read in. If your data file did not have a header line, then the parameter „head-er=FALSE‟ tells the function to start reading the data at line 1; you can then assign names to the columns as you desire. If you have another separator like a „tab‟ or „semicolon‟, these can

be specified.

The data is read into an object („cpu.15‟)

which is a „dataframe‟. In R, a dataframe

is very similar to an Excel spreadsheet in that it looks like a table where each of the columns can have a different attribute (e.g., character, numeric, etc.) and it is easy to reference the data items indivi-dually or as a vector representing the

entire column. Part of the power of R

comes from the „vectorized‟ operations that make it easy to defined transformations on the data. The con-tents of the dataframe are shown in Figure 5; notice that it looks very similar to the Excel spreadsheet in Figure 1.

As in any programming environment, there are a num-

ber of ways of getting similar results. In R there are a

number of functions (apply, aggregate, tapply, etc.) that can summarize data in a pivot table-like format.

R also has a number of “packages” (similar to mod-

ules in Perl, classes in Java, or libraries in C/C++) which encapsulate useful functions that minimize the amount of code that has to be written.

R has a number of these packages that make it easy

to “transform” data, aggregate the data and then summarize the results. One of the packages that I have found very useful is the “reshape” package

which lets you restructure and aggregate your data

Figure 4 - R Commands to Create the Pivot Table

Figure 3 - Pivot Table From Excel

DAY,HOUR,MIN,SEC,MACHINE,LPAR,PHY_TOT,MIPS,CPS,CPU_HOUR,TYPE

6/2/2008,0,30,1,713,*PHYSI,3.26,176.5907684,13,0.4238,PROD

6/2/2008,0,30,1,713,AAMTBC,0,0,13,0,TEST

6/2/2008,0,30,1,713,BBMTBC,0,0,13,0,TEST

6/2/2008,0,30,1,713,GGMTBC,0,0,13,0,TEST

6/2/2008,0,30,1,713,QA,0.63,34.12643684,13,0.0819,TEST

6/2/2008,0,30,1,713,QB,1.32,71.50301053,13,0.1716,TEST

6/2/2008,0,30,1,713,QD,0.44,23.83433684,13,0.0572,TEST

6/2/2008,0,30,1,713,SOLAR1,0.33,17.87575263,13,0.0429,PROD

Figure 2 - CSV File for Input to R

with just two functions: „melt‟ and „cast‟. “melt” puts the data into a format that can be used by „cast‟ to then create new aggregations of the data. Documen-tation is provided with the package that provides plenty of examples of how to use it.

So in the script, I indicate that I want to use the package [re-

quire(reshape)], and then I „melt‟

the dataframe that was read into speci-fying that I intend to use three of the columns (DAY, HOUR, MIN) to aggre-gate the data and that the value I want aggregate is CPU_HOUR.

Now that the data has been „melt‟ed, it can be „cast‟ into some output. The cast function has as its first parameter the object (cpu.melt) from the „melt‟, and then a formula specifying how the data is to be aggregated. The formula „DAY + HOUR ~ MIN‟ indicates that the rows will have DAY and HOUR, and

that columns will contain the MIN. The data will be aggre-gated with these variables and the „sum‟ will be computed and stored in the resulting data-frame. There is also a para-meter to indicate that “mar-gins” are to be created. Mar-gins will produce row totals and column total on the control breaks, which in this case is DAY. The output of the first 25

lines is shown in Figure 4.

Comparing this output with Figure 3 shows the results are the same; the layout of the data is different. The last command just creates a pivot table for summariz-ing the CPU_HOUR per day. The data file had over 10,000 lines of data. It took 1 second to read the data in and create the two “pivot” table outputs. The script can be reused to read in any number of data files.

2.2 Pivot Charts

Another use of the output from a pivot table is to gen-erate a chart. John had a data file about batch jobs being run. A sample of the contents of the Excel spreadsheet is shown in Figure 6. This data was summarized by shift and the pie chart in Figure 7 was created; the pivot table for this chart is Figure 8.

Figure 9 shows the R script used to read in the CSV

file create from the Excel spreadsheet, summarize the cpu hours by shift and then create the pie chart in Fig-

ure 11. This used another R function (tapply) to

create the aggregation by shift. As I mentioned pre-viously, there are a number of ways of doing things in

R. I did notice one difference in the data in that John‟s

pivot table filtered out HOLIDAY since there was such a small usage. I choose to leave it in, but could have easily removed it from the data. This file had about 24,000 data lines. It took 0.5 seconds to read in the data, aggregate the data and generate the pie chart.

The final example makes use of some implied informa-tion in the data. In the spreadsheet the column DB2 had a name that if the 3

rd character was a “P”, then it

Figure 5 - Dataframe in R Created from the CSV File (Looks Like Excel

Spreadsheet)

Figure 6 - Batch Data From Excel

Figure 7 - Pie Chart of Shift Usage

Figure 8 - Pivot Table of Shift Usage

was production (PROD), otherwise it was development

(DEV). So when the data was read in, a new column was added with this indication so the pivot table could be generated.

Figure 10, Figure 12 and Figure 13 are the data in the Excel spreadsheet and the pivot table and chart

created from the data. Figure 14 is the R script to read

in the data, create a new column with the workload, create the pivot table and then generate the chart in Figure 15. This data only had 96 rows and it took 0.2 seconds to read in the data, do the transformations, generate the pivot table and the chart.

3 Sparklines

In Ron Kaminski‟s paper on “Automating Process Pa-thology Detection – Rule Engine Design Hints” he de-scribed “sparklines” as one of the ways of presenting a

lot of data in a small amount of space. Basically sparklines are graphs without the axes to clutter up the presentation of informa-tion. Sparklines were invented by Edward Tufte who is a well known expert on data visualiza-tion.

Figure 16 is an example of spar-klines showing the price of 4 stocks over a 5 year period. You can see that they have roughly the same shape, even though the y-axis has different ranges. Numbers provide the extent of these ranges and identify other important points.

I have used multiple graphs on a page to show the relationships between various measurements, but typically I was limited to dis-playing around 15 charts with all the extra space being taken up

Figure 9 - R Script for Shift Usage

Figure 11 - Pie Chart from R

HOLIDAY

PERIOD2

PERIOD3

PRIME

WEEKEND

Breakdown by Shifts

Figure 10 - Excel Data for Prod/Dev Pivot Table

Figure 12 - Excel Pivot Table from Data

Figure 13 - Chart from Excel Pivot Table

0

500000

1000000

1500000

2000000

2500000

5/1

/2007

6/1

/2007

7/1

/2007

8/1

/2007

9/1

/2007

10/1

/2007

11/1

/2007

12/1

/2007

1/1

/2008

2/1

/2008

3/1

/2008

4/1

/2008

5/1

/2008

6/1

/2008

cp

u s

ec

on

ds

DEV

PROD

by the labeling of the axes.

Figure 25 is just to show the amount of space that is taken up with labeling the axes and such. It also makes it hard to compare different graphs to look for patterns.

With R it is easy to generate sparklines because you

have complete control over how graphics are created.

R has some very sophisticated graphics, but I will use

just the basic graphics to show how sparklines can be created.

The only difference between creating a set of charts like Figure 25 and sparklines, is telling the system not to create the axes and to plot the data in a smaller window. The charts in Figure 25 were created from running the „vmstat‟ command on a UNIX system. Vmstat will record about 20 different measurements includ-ing CPU utilization, memory and number of run-ning processes. Similar data will be used to demonstrate sparklines.

One of the scripts that I have running on sys-tems that I monitor writes the vmstat data to a file with a timestamp. This data is then read by

the analysis programs and reports and charts are created. An example of the log file is shown in Figure 17.

This data is read in and results in as a matrix with each row being a sample and the columns the data for

that sample. Figure 18 shows the amount of R code

that was written to create a plot of sparklines with „nr‟ rows and „nc‟ columns on a single page. Figure 26 is the sparklines that were generated.

This represents one day of system operation (00:00 – 24:00). On the left side of each sparkline is the name of the measurement being plotted. This is followed by its average value over the day. The average value is represented by the horizontal gray line that can be used as a reference as to the variation of the data.

The red number on the left above the gray line is the maximum value; the green number on the right below the gray line is the minimum value for the day. This allows you to quickly see some of the relationships. There is also a red dot to mark the first maximum and a green dot to mark the first minimum of the sam-ple.

The easiest one to point out is the last two lines on the chart; the idle time and the „user + system‟ time. As you can see these are mirror images of each other and this is what you would expect from the data.

Even without the time being explicit, since we know that this represents a 24-hour day, we can see that the first third of the day appears to be the busiest with the overall activity in the rest of the day being low. For this system, that is what happens; it processes the perfor-mance data from a number of systems by downloading load files and then processing the data so that it is

Figure 14 - R Script to Create Pivot Table and Chart

Figure 15 - Chart Generated from R

20

07

-05

-01

20

07

-06

-01

20

07

-07

-01

20

07

-08

-01

20

07

-09

-01

20

07

-10

-01

20

07

-11

-01

20

07

-12

-01

20

08

-01

-01

20

08

-02

-01

20

08

-03

-01

20

08

-04

-01

20

08

-05

-01

20

08

-06

-01

PROD

DEV

0

500000

1000000

1500000

2000000

To

tal C

PU

Se

co

nd

s

Figure 16 - Example of Sparklines

Figure 17 - Example of vmstat Log File

ready by 07:00 for review to see how the system per-

formed the previous day.

Figure 27 was from a CMG2004 paper I wrote and is a “levelplot” for the system utilization for a month. It uses color to show what would be the z-axis value (utilization) if this were a 3D graph. The data used to create the sparklines is 5/16/05 so you should be able to compare the utilization (user + sys) of the spar-kline with the levelplot.

I also added to the plot the set of sparklines for the same period. Do they both convey the same in-formation to you?

In Figure 28 I just took the month‟s worth of spar-

klines and replicated them 12 times to show what a year‟s worth of utilization might look like. Wouldn‟t it be nice to have a page like this for each of your systems so that you would look for patterns. You could also line up the plot so that a day of the week was a row so you could see the pattern for that

day in the month across the year.

If you really like 3D plots, R

can generate those also. The „rgl‟ package will create a 3D plot that you can rotate with a mouse to see different views. Figure 29, Figure 30 and Figure 31 show the interactive 3D graphs that can be created with

R.

4 Transaction Data

I want to use some transaction data to show another way of visualizing the data from a pivot table. I originally had a trans-action log of 79,000 transac-tions; 159 transaction types across 300 users. To make the data easier to present, I created 10 transaction types by splitting the transactions based on their response times (Trans.01 has the shortest av-erage response time and

Figure 18 - R Function to Plot Each Column as a Sparkline With ‘nr’ Rows and ‘nc’

Columns

Figure 19 - Transaction Count for User/Tran

Figure 20 - Stacked Bar Chart of the Transaction Count

User.01 User.02 User.03 User.04 User.05 User.06 User.07 User.08 User.09 User.10

Transaction Count by User

Tota

l T

ransactions

05000

10000

15000

Tran.01

Tran.02

Tran.03

Tran.04

Tran.05

Tran.06

Tran.07

Tran.08

Tran.09

Tran.10

Trans.10 has the longest). The users were just split into 10 groups randomly.

The log file has the user, transaction, start and end

times. The file was read in with an R script and the

pivot table in Figure 19 was created. If you look at the data, User.06 has the smallest transaction count and User.08 the largest.

One way of visualizing this information is using a stacked barchart as shown in Figure 20. Here it is easy to see that User.08 entered the most transactions and User.06 the least. But it is hard to determine for each user the ratios between the individual transac-tions for that user. This is where a “mosaic” plot helps to visualize this relationship. In a mosaic plot, the val-ues are plotted as rectangles and the area of the rec-tangle is proportional to the count. The vertical axis will be the same for all variables so that you can see the relationships of the transaction counts for a user.

Figure 21 is the mosaic plot of the pivot table data. You can see on the chart that User.08 has the widest vertical area indicating that this user has the highest

total transaction count; User.06 has the least area indicating the lowest transaction count.

In this view of the data, you can see that User.06 has a higher percentage of transactions Tran.06, Tran.09 and Tran.10 than User.08. This might indi-cate that these two users have different roles and therefore execute different transaction mixes. A mosaic plot can help identify this condition.

Figure 24 shows the relation-ship of the ratios of the average response times of transactions for each user. Here you can see that Trans.10 appears to have an average response time that is almost equal to the sum of the response times for the other 9 transactions. Again, based on how I partitioned the transactions, Trans.10 should have the longest response time, but even across some of the users, there is quite a bit of

variation. Remember that this chart does not show the value of the average response time of a transaction for a user, just the ratio of its re-sponse time compared to the other transactions executed by that user. Figure 22 shows the av-

erage transaction response time for each user.

Even though Trans.10 has the longest response time, relatively it is less frequently executed than most of the other transactions as you can see in Figure 21.

Figure 23 is a graph of the sparklines of the distribu-tion of the average response times of the transactions for a given user. Think of this as a histogram drawn with a smooth line. The x-axis is 0-3 seconds for the response times. In the data, there was a maximum of 879 seconds for one transaction (I am not sure if the user really waited for a response in this case); the 95

th

percentile was 1.7 seconds, so I choose 3 seconds for the chart since this encompassed over 95% of all the transactions. In most cases, SLAs (service level agreements) are based on XX% of the responses time being less than a given number. Systems I have worked on in the past had this number as 90%/95%.

Looking at the data, it appears that User.06 and Us-er.07 have larger tails on the right side indicating that they have are experiencing longer average response times. The pivot table in Figure 22 shows that these

Figure 21 - Mosaic Plot of Transaction Counts for a User

Mosaic Plot of the Number of Transactions by User - Area Proportional to Count

User

Tra

n

Use

r.0

1

Use

r.0

2

Use

r.0

3

Use

r.0

4

Use

r.0

5

Use

r.0

6

Use

r.0

7

Use

r.0

8

Use

r.0

9

Use

r.1

0

Trans.01

Trans.02

Trans.03

Trans.04

Trans.05

Trans.06

Trans.07

Trans.08

Trans.09

Trans.10

Figure 22 - Average Response Time of Transactions for Each User

users do have the longest average response times across all their transactions.

These users might have different roles, and therefore execute a different mix of transactions some of which have longer response times. It is this type of analysis that leads to a better understanding of your environ-ment.

5 Wrap-Up

Hopefully I have given you some examples of other

things that R can do, and hopefully they will whet your

appetite to learn more about R.

R should be considered as one of the tools that you

have in your toolkit. In my current engagement, I use

R for most of the analysis that I do, but still make ex-

tensive use of Excel. Excel happens to be the pre-ferred way of interchanging data among the other

people on the projects. They will give me data in an Excel spreadsheet that I can use as input. When I generate output, in many cases I will trans-

fer the results to an Excel spreadsheet (R can

write Excel workbooks with multiple sheets) since it allows the recipient to do further manipulations of the data, or to include the data into Word docu-ments or PowerPoint presentations.

The R scripts, and data, used in this paper are

available if you send me email requesting them.

6 References

[1] J. Van Wagenen, “Pivot Tables/Charts – Magic Beans Without Living in a Fairy Tale”, CMG

2008

[2] Ron Kaminski, “Automating Process Pathol-ogy Detection – Rule Engine Design Hints”, CMG 2008

[3] R Development Core Team, “R: A Language and Environment for Statistical Computing”, {ISBN} 3-900051-07-0, http://www.R-project.org

[4] J. Holtman, “Using R for System Perfor-

mance Analysis”, CMG 2004

[5] J. Holtman, “Visualization Techniques for Analyzing Patterns in System Perfor-mance Data”, CMG 2005

[6] N. J. Gunther, “Guerrilla Capacity Planning”, Springer-Verlag, Heidelberg, Germa-ny, 2007

[7] H. Wickham, “Reshaping data with the re-shape package”, Journal of Statistical Software, 21(12), 2007

[8] Venables, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth Edi-tion. Springer, 2002, ISBN 0-387-95458-0

[9] Tufte, Edward Beautiful Evidence Graphic Press 2006

[10] Spector, Phil Data Manipulation with R (Use R) Springer, 2009. ISBN 978-0387747309

Figure 24 – Ratios of Average Response Time of Transactions for a User

Mosaic Plot of Response Times - Area Proportional to Time

User.

01

User.

02

User.

03

User.

04

User.

05

User.

06

User.

07

User.

08

User.

09

User.

10

Trans.01Trans.02

Trans.03

Trans.04

Trans.05

Trans.06

Trans.07

Trans.08

Trans.09

Trans.10

Figure 23 -Sparklines of the Density (Histogram) Plot of Re-sponse Times for a User

Response Time Distribution -- Sparklines

User.01

User.02

User.03

User.04

User.05

User.06

User.07

User.08

User.09

User.10

http://www.r-project.org/

Figure 25 - Typical Multiplots Per Page - Data from 5/16/05

Figure 26 - Sparklines Created from 'vmstat' Log File: 19 Different Measurements for 5/16/05 (red is max; green is min)

Figure 27 - Levelplot (3D on 2D Surface) of System Utilization for a Month + Equivalent Sparklines

Figure 28 - What One Year of System Utilization Might Look Like in Sparklines

Figure 29 - 3D Chart of the Utilization Data

Figure 30 - Another View of the Same Data

Figure 31 - Yet Another View from Underneath

jim holtman jholtman@gmail - meetupfiles.meetup.com/1736007/r does pivot tablev8.pdf ·...

Documents