jim holtman jholtman@gmail - meetupfiles.meetup.com/1736007/r does pivot tablev8.pdf ·...
TRANSCRIPT
Does Pivot Tables and More
Jim Holtman [email protected]
There were several papers at CMG2008, and previous conferences, that got me thinking about other ways that R can help with the analysis and visualization of performance data. There were a couple of sessions that made use of pivot tables in Excel to help analyze data. There was also a paper that referenced “sparklines” as a method of visualizing data. This paper will show how R can be used to do these, and other, procedures that will enhance your ability to better analyze per-formance data.
1 Overview
At CMG many of the papers describe how various per-formance metrics about a system can be analyzed. There are a number of different ways that this data is collected (proprietary vendor code, open source, user written scripts, etc.). Once this data is collected, there are again a variety of vendor, open source and user written procedures to process this information. Many of these are very flexible in providing a user with ways of customizing the subset of data to be analyzed, the algorithms to analyze the data and the format for the presentation of this data.
I have used many of these tools in the past and still rely on them. Like most practitioners of computer per-formance analysis, I have my own tool chest of things that make by life easier. These include Perl for pre-processing/formatting unstructured data from log files, standard text editors for examining/changing data, Ex-cel for quick looks at the data and for communicating results to others who are used to working with Excel,
and of course R which is my favorite because of its
versatility for analysis and graphical presentation of the results.
R is an open source language and
environment for statistical processing. It is based on the S language originally developed at Bell Labs by John Chambers who won the ACM award in 1998 for the language. It easily handles data files with millions of records (e.g., transaction response times), and compute, for example, the average response time and create a histo-gram of the response times in less than a couple of seconds.
The graphics available in R for data visualization are
very rich and flexible. Being able to slice/dice your data and then visualize it in various ways allows you to quickly see patterns in your data that just numbers in a table will not reveal.
It is very well supported through an active user‟s group and there are over 85 books available covering the
areas that R has been used for. I have used it for the
last 25 years for doing computer performance analy-sis.
To quickly find R on the internet, just type “R” into
Google and it will be the first hit. The links will provide
an overview of R. There is a learning curve to it, but it
is well worth the effort if you are serious about perfor-mance analysis. The presentation slides have a “10
minute R workshop” which provides an overview of R.
2 Pivot Tables
John Van Wagenen‟s paper “Pivot Tables/Charts – Magic Beans Without Living in a Fairy Tale” at CMG 2008 gave a very good overview of how pivot tables can help in analyzing, and visualizing, data that a per-formance analyst typically works with.
Pivot tables allow an analyst to slice/dice the data in various ways, and to create aggregations of the data by various classifications. Pivot tables are typically associated with Excel, but the same information can
be constructed by a variety of packages. For example, SQL statements can be used to “group” the data by various criteria and then summarize the results. Most of the vendor supplied packages have similar capabili-ties.
John gave his permission to use the data from his pa-per so that I can illustrate that the results are similar
when using R. The spreadsheet that he shared with
Figure 1 - Sample 15 Minute Data From Excel
me had some different data, but it did have the pivot tables generated from this data.
2.1 Pivot Tables
The first example is from 15 minute data that was collected on system utilization. Figure 1 is a sample of the first entries in the Excel spread-
sheet. To read this data into R, I converted the
spreadsheet to a CSV file. R can directly read
from Excel spreadsheets, but it is easier to illu-strate the processing if we assume the data is in a file, since that is probably where most data is located. The resulting CSV is shown in Figure 2.
In Excel a pivot table was created summarizing the CPU_HOUR over each DAY, HOUR and MIN, and generating total on each of the breaks. The Excel pivot table is shown in Fig-ure 3. You can read John‟s paper to see how to setup the pivot table from the given input.
To create a similar output in R, the script is
shown in Figure 4. The first statement („read.csv‟) calls a function that will read in a CSV (comma separated variable) file. The de-fault parameters are that the separator is a comma and that there is a header line in the file that defines the names of the columns when the data is read in. If your data file did not have a header line, then the parameter „head-er=FALSE‟ tells the function to start reading the data at line 1; you can then assign names to the columns as you desire. If you have another separator like a „tab‟ or „semicolon‟, these can
be specified.
The data is read into an object („cpu.15‟)
which is a „dataframe‟. In R, a dataframe
is very similar to an Excel spreadsheet in that it looks like a table where each of the columns can have a different attribute (e.g., character, numeric, etc.) and it is easy to reference the data items indivi-dually or as a vector representing the
entire column. Part of the power of R
comes from the „vectorized‟ operations that make it easy to defined transformations on the data. The con-tents of the dataframe are shown in Figure 5; notice that it looks very similar to the Excel spreadsheet in Figure 1.
As in any programming environment, there are a num-
ber of ways of getting similar results. In R there are a
number of functions (apply, aggregate, tapply, etc.) that can summarize data in a pivot table-like format.
R also has a number of “packages” (similar to mod-
ules in Perl, classes in Java, or libraries in C/C++) which encapsulate useful functions that minimize the amount of code that has to be written.
R has a number of these packages that make it easy
to “transform” data, aggregate the data and then summarize the results. One of the packages that I have found very useful is the “reshape” package
which lets you restructure and aggregate your data
Figure 4 - R Commands to Create the Pivot Table
Figure 3 - Pivot Table From Excel
DAY,HOUR,MIN,SEC,MACHINE,LPAR,PHY_TOT,MIPS,CPS,CPU_HOUR,TYPE
6/2/2008,0,30,1,713,*PHYSI,3.26,176.5907684,13,0.4238,PROD
6/2/2008,0,30,1,713,AAMTBC,0,0,13,0,TEST
6/2/2008,0,30,1,713,BBMTBC,0,0,13,0,TEST
6/2/2008,0,30,1,713,GGMTBC,0,0,13,0,TEST
6/2/2008,0,30,1,713,QA,0.63,34.12643684,13,0.0819,TEST
6/2/2008,0,30,1,713,QB,1.32,71.50301053,13,0.1716,TEST
6/2/2008,0,30,1,713,QD,0.44,23.83433684,13,0.0572,TEST
6/2/2008,0,30,1,713,SOLAR1,0.33,17.87575263,13,0.0429,PROD
Figure 2 - CSV File for Input to R
with just two functions: „melt‟ and „cast‟. “melt” puts the data into a format that can be used by „cast‟ to then create new aggregations of the data. Documen-tation is provided with the package that provides plenty of examples of how to use it.
So in the script, I indicate that I want to use the package [re-
quire(reshape)], and then I „melt‟
the dataframe that was read into speci-fying that I intend to use three of the columns (DAY, HOUR, MIN) to aggre-gate the data and that the value I want aggregate is CPU_HOUR.
Now that the data has been „melt‟ed, it can be „cast‟ into some output. The cast function has as its first parameter the object (cpu.melt) from the „melt‟, and then a formula specifying how the data is to be aggregated. The formula „DAY + HOUR ~ MIN‟ indicates that the rows will have DAY and HOUR, and
that columns will contain the MIN. The data will be aggre-gated with these variables and the „sum‟ will be computed and stored in the resulting data-frame. There is also a para-meter to indicate that “mar-gins” are to be created. Mar-gins will produce row totals and column total on the control breaks, which in this case is DAY. The output of the first 25
lines is shown in Figure 4.
Comparing this output with Figure 3 shows the results are the same; the layout of the data is different. The last command just creates a pivot table for summariz-ing the CPU_HOUR per day. The data file had over 10,000 lines of data. It took 1 second to read the data in and create the two “pivot” table outputs. The script can be reused to read in any number of data files.
2.2 Pivot Charts
Another use of the output from a pivot table is to gen-erate a chart. John had a data file about batch jobs being run. A sample of the contents of the Excel spreadsheet is shown in Figure 6. This data was summarized by shift and the pie chart in Figure 7 was created; the pivot table for this chart is Figure 8.
Figure 9 shows the R script used to read in the CSV
file create from the Excel spreadsheet, summarize the cpu hours by shift and then create the pie chart in Fig-
ure 11. This used another R function (tapply) to
create the aggregation by shift. As I mentioned pre-viously, there are a number of ways of doing things in
R. I did notice one difference in the data in that John‟s
pivot table filtered out HOLIDAY since there was such a small usage. I choose to leave it in, but could have easily removed it from the data. This file had about 24,000 data lines. It took 0.5 seconds to read in the data, aggregate the data and generate the pie chart.
The final example makes use of some implied informa-tion in the data. In the spreadsheet the column DB2 had a name that if the 3
rd character was a “P”, then it
Figure 5 - Dataframe in R Created from the CSV File (Looks Like Excel
Spreadsheet)
Figure 6 - Batch Data From Excel
Figure 7 - Pie Chart of Shift Usage
Figure 8 - Pivot Table of Shift Usage
was production (PROD), otherwise it was development
(DEV). So when the data was read in, a new column was added with this indication so the pivot table could be generated.
Figure 10, Figure 12 and Figure 13 are the data in the Excel spreadsheet and the pivot table and chart
created from the data. Figure 14 is the R script to read
in the data, create a new column with the workload, create the pivot table and then generate the chart in Figure 15. This data only had 96 rows and it took 0.2 seconds to read in the data, do the transformations, generate the pivot table and the chart.
3 Sparklines
In Ron Kaminski‟s paper on “Automating Process Pa-thology Detection – Rule Engine Design Hints” he de-scribed “sparklines” as one of the ways of presenting a
lot of data in a small amount of space. Basically sparklines are graphs without the axes to clutter up the presentation of informa-tion. Sparklines were invented by Edward Tufte who is a well known expert on data visualiza-tion.
Figure 16 is an example of spar-klines showing the price of 4 stocks over a 5 year period. You can see that they have roughly the same shape, even though the y-axis has different ranges. Numbers provide the extent of these ranges and identify other important points.
I have used multiple graphs on a page to show the relationships between various measurements, but typically I was limited to dis-playing around 15 charts with all the extra space being taken up
Figure 9 - R Script for Shift Usage
Figure 11 - Pie Chart from R
HOLIDAY
PERIOD2
PERIOD3
PRIME
WEEKEND
Breakdown by Shifts
Figure 10 - Excel Data for Prod/Dev Pivot Table
Figure 12 - Excel Pivot Table from Data
Figure 13 - Chart from Excel Pivot Table
0
500000
1000000
1500000
2000000
2500000
5/1
/2007
6/1
/2007
7/1
/2007
8/1
/2007
9/1
/2007
10/1
/2007
11/1
/2007
12/1
/2007
1/1
/2008
2/1
/2008
3/1
/2008
4/1
/2008
5/1
/2008
6/1
/2008
cp
u s
ec
on
ds
DEV
PROD
by the labeling of the axes.
Figure 25 is just to show the amount of space that is taken up with labeling the axes and such. It also makes it hard to compare different graphs to look for patterns.
With R it is easy to generate sparklines because you
have complete control over how graphics are created.
R has some very sophisticated graphics, but I will use
just the basic graphics to show how sparklines can be created.
The only difference between creating a set of charts like Figure 25 and sparklines, is telling the system not to create the axes and to plot the data in a smaller window. The charts in Figure 25 were created from running the „vmstat‟ command on a UNIX system. Vmstat will record about 20 different measurements includ-ing CPU utilization, memory and number of run-ning processes. Similar data will be used to demonstrate sparklines.
One of the scripts that I have running on sys-tems that I monitor writes the vmstat data to a file with a timestamp. This data is then read by
the analysis programs and reports and charts are created. An example of the log file is shown in Figure 17.
This data is read in and results in as a matrix with each row being a sample and the columns the data for
that sample. Figure 18 shows the amount of R code
that was written to create a plot of sparklines with „nr‟ rows and „nc‟ columns on a single page. Figure 26 is the sparklines that were generated.
This represents one day of system operation (00:00 – 24:00). On the left side of each sparkline is the name of the measurement being plotted. This is followed by its average value over the day. The average value is represented by the horizontal gray line that can be used as a reference as to the variation of the data.
The red number on the left above the gray line is the maximum value; the green number on the right below the gray line is the minimum value for the day. This allows you to quickly see some of the relationships. There is also a red dot to mark the first maximum and a green dot to mark the first minimum of the sam-ple.
The easiest one to point out is the last two lines on the chart; the idle time and the „user + system‟ time. As you can see these are mirror images of each other and this is what you would expect from the data.
Even without the time being explicit, since we know that this represents a 24-hour day, we can see that the first third of the day appears to be the busiest with the overall activity in the rest of the day being low. For this system, that is what happens; it processes the perfor-mance data from a number of systems by downloading load files and then processing the data so that it is
Figure 14 - R Script to Create Pivot Table and Chart
Figure 15 - Chart Generated from R
20
07
-05
-01
20
07
-06
-01
20
07
-07
-01
20
07
-08
-01
20
07
-09
-01
20
07
-10
-01
20
07
-11
-01
20
07
-12
-01
20
08
-01
-01
20
08
-02
-01
20
08
-03
-01
20
08
-04
-01
20
08
-05
-01
20
08
-06
-01
PROD
DEV
0
500000
1000000
1500000
2000000
To
tal C
PU
Se
co
nd
s
Figure 16 - Example of Sparklines
Figure 17 - Example of vmstat Log File
ready by 07:00 for review to see how the system per-
formed the previous day.
Figure 27 was from a CMG2004 paper I wrote and is a “levelplot” for the system utilization for a month. It uses color to show what would be the z-axis value (utilization) if this were a 3D graph. The data used to create the sparklines is 5/16/05 so you should be able to compare the utilization (user + sys) of the spar-kline with the levelplot.
I also added to the plot the set of sparklines for the same period. Do they both convey the same in-formation to you?
In Figure 28 I just took the month‟s worth of spar-
klines and replicated them 12 times to show what a year‟s worth of utilization might look like. Wouldn‟t it be nice to have a page like this for each of your systems so that you would look for patterns. You could also line up the plot so that a day of the week was a row so you could see the pattern for that
day in the month across the year.
If you really like 3D plots, R
can generate those also. The „rgl‟ package will create a 3D plot that you can rotate with a mouse to see different views. Figure 29, Figure 30 and Figure 31 show the interactive 3D graphs that can be created with
R.
4 Transaction Data
I want to use some transaction data to show another way of visualizing the data from a pivot table. I originally had a trans-action log of 79,000 transac-tions; 159 transaction types across 300 users. To make the data easier to present, I created 10 transaction types by splitting the transactions based on their response times (Trans.01 has the shortest av-erage response time and
Figure 18 - R Function to Plot Each Column as a Sparkline With ‘nr’ Rows and ‘nc’
Columns
Figure 19 - Transaction Count for User/Tran
Figure 20 - Stacked Bar Chart of the Transaction Count
User.01 User.02 User.03 User.04 User.05 User.06 User.07 User.08 User.09 User.10
Transaction Count by User
Tota
l T
ransactions
05000
10000
15000
Tran.01
Tran.02
Tran.03
Tran.04
Tran.05
Tran.06
Tran.07
Tran.08
Tran.09
Tran.10
Trans.10 has the longest). The users were just split into 10 groups randomly.
The log file has the user, transaction, start and end
times. The file was read in with an R script and the
pivot table in Figure 19 was created. If you look at the data, User.06 has the smallest transaction count and User.08 the largest.
One way of visualizing this information is using a stacked barchart as shown in Figure 20. Here it is easy to see that User.08 entered the most transactions and User.06 the least. But it is hard to determine for each user the ratios between the individual transac-tions for that user. This is where a “mosaic” plot helps to visualize this relationship. In a mosaic plot, the val-ues are plotted as rectangles and the area of the rec-tangle is proportional to the count. The vertical axis will be the same for all variables so that you can see the relationships of the transaction counts for a user.
Figure 21 is the mosaic plot of the pivot table data. You can see on the chart that User.08 has the widest vertical area indicating that this user has the highest
total transaction count; User.06 has the least area indicating the lowest transaction count.
In this view of the data, you can see that User.06 has a higher percentage of transactions Tran.06, Tran.09 and Tran.10 than User.08. This might indi-cate that these two users have different roles and therefore execute different transaction mixes. A mosaic plot can help identify this condition.
Figure 24 shows the relation-ship of the ratios of the average response times of transactions for each user. Here you can see that Trans.10 appears to have an average response time that is almost equal to the sum of the response times for the other 9 transactions. Again, based on how I partitioned the transactions, Trans.10 should have the longest response time, but even across some of the users, there is quite a bit of
variation. Remember that this chart does not show the value of the average response time of a transaction for a user, just the ratio of its re-sponse time compared to the other transactions executed by that user. Figure 22 shows the av-
erage transaction response time for each user.
Even though Trans.10 has the longest response time, relatively it is less frequently executed than most of the other transactions as you can see in Figure 21.
Figure 23 is a graph of the sparklines of the distribu-tion of the average response times of the transactions for a given user. Think of this as a histogram drawn with a smooth line. The x-axis is 0-3 seconds for the response times. In the data, there was a maximum of 879 seconds for one transaction (I am not sure if the user really waited for a response in this case); the 95
th
percentile was 1.7 seconds, so I choose 3 seconds for the chart since this encompassed over 95% of all the transactions. In most cases, SLAs (service level agreements) are based on XX% of the responses time being less than a given number. Systems I have worked on in the past had this number as 90%/95%.
Looking at the data, it appears that User.06 and Us-er.07 have larger tails on the right side indicating that they have are experiencing longer average response times. The pivot table in Figure 22 shows that these
Figure 21 - Mosaic Plot of Transaction Counts for a User
Mosaic Plot of the Number of Transactions by User - Area Proportional to Count
User
Tra
n
Use
r.0
1
Use
r.0
2
Use
r.0
3
Use
r.0
4
Use
r.0
5
Use
r.0
6
Use
r.0
7
Use
r.0
8
Use
r.0
9
Use
r.1
0
Trans.01
Trans.02
Trans.03
Trans.04
Trans.05
Trans.06
Trans.07
Trans.08
Trans.09
Trans.10
Figure 22 - Average Response Time of Transactions for Each User
users do have the longest average response times across all their transactions.
These users might have different roles, and therefore execute a different mix of transactions some of which have longer response times. It is this type of analysis that leads to a better understanding of your environ-ment.
5 Wrap-Up
Hopefully I have given you some examples of other
things that R can do, and hopefully they will whet your
appetite to learn more about R.
R should be considered as one of the tools that you
have in your toolkit. In my current engagement, I use
R for most of the analysis that I do, but still make ex-
tensive use of Excel. Excel happens to be the pre-ferred way of interchanging data among the other
people on the projects. They will give me data in an Excel spreadsheet that I can use as input. When I generate output, in many cases I will trans-
fer the results to an Excel spreadsheet (R can
write Excel workbooks with multiple sheets) since it allows the recipient to do further manipulations of the data, or to include the data into Word docu-ments or PowerPoint presentations.
The R scripts, and data, used in this paper are
available if you send me email requesting them.
6 References
[1] J. Van Wagenen, “Pivot Tables/Charts – Magic Beans Without Living in a Fairy Tale”, CMG
2008
[2] Ron Kaminski, “Automating Process Pathol-ogy Detection – Rule Engine Design Hints”, CMG 2008
[3] R Development Core Team, “R: A Language and Environment for Statistical Computing”, {ISBN} 3-900051-07-0, http://www.R-project.org
[4] J. Holtman, “Using R for System Perfor-
mance Analysis”, CMG 2004
[5] J. Holtman, “Visualization Techniques for Analyzing Patterns in System Perfor-mance Data”, CMG 2005
[6] N. J. Gunther, “Guerrilla Capacity Planning”, Springer-Verlag, Heidelberg, Germa-ny, 2007
[7] H. Wickham, “Reshaping data with the re-shape package”, Journal of Statistical Software, 21(12), 2007
[8] Venables, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth Edi-tion. Springer, 2002, ISBN 0-387-95458-0
[9] Tufte, Edward Beautiful Evidence Graphic Press 2006
[10] Spector, Phil Data Manipulation with R (Use R) Springer, 2009. ISBN 978-0387747309
Figure 24 – Ratios of Average Response Time of Transactions for a User
Mosaic Plot of Response Times - Area Proportional to Time
User.
01
User.
02
User.
03
User.
04
User.
05
User.
06
User.
07
User.
08
User.
09
User.
10
Trans.01Trans.02
Trans.03
Trans.04
Trans.05
Trans.06
Trans.07
Trans.08
Trans.09
Trans.10
Figure 23 -Sparklines of the Density (Histogram) Plot of Re-sponse Times for a User
Response Time Distribution -- Sparklines
User.01
User.02
User.03
User.04
User.05
User.06
User.07
User.08
User.09
User.10
Figure 25 - Typical Multiplots Per Page - Data from 5/16/05
Figure 26 - Sparklines Created from 'vmstat' Log File: 19 Different Measurements for 5/16/05 (red is max; green is min)
Figure 27 - Levelplot (3D on 2D Surface) of System Utilization for a Month + Equivalent Sparklines
Figure 28 - What One Year of System Utilization Might Look Like in Sparklines
Figure 29 - 3D Chart of the Utilization Data
Figure 30 - Another View of the Same Data
Figure 31 - Yet Another View from Underneath