biological data analysis using r

226
Biological Data Analysis Using R Rodney J. Dyer, PhD Department of Biology Center for the Study of Biological Complexity Virginia Commonwealth University

Upload: andre-santana

Post on 16-Apr-2015

113 views

Category:

Documents


6 download

DESCRIPTION

A book on the analysis of biological data using the R statistical programming language

TRANSCRIPT

Biological Data Analysis Using R

Rodney J. Dyer, PhD

Department of BiologyCenter for the Study of Biological Complexity

Virginia Commonwealth University

2

Biological Data Analysis Using R

Contents

Preface xi

I Basic Usability 1

1 Getting R 31.1 What Is R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Where Do I Get It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Language & Grammar 52.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Function Quickie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

II Biologically Motivated Topics 23

3 Data Frames 253.1 Data Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Complex Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Summary Statistics 434.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.3 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4 Relationships Between Pairs of Variables . . . . . . . . . . . . . . . . . . . . . 634.5 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Contingency Tables 715.1 One Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

i

ii CONTENTS

5.2 Paired Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.3 Several Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.4 The Formula Notation & Box Plots . . . . . . . . . . . . . . . . . . . . . . . . 835.5 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 Linear Models 896.1 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.2 Regression With A Single Variable . . . . . . . . . . . . . . . . . . . . . . . . . 916.3 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.4 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.5 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7 Working With Images 1097.1 Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.2 Loading The Image Into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3 Components of A Pixmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.4 Image Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.5 Creating Images Programatically . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.6 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8 Matrix Analysis 1218.1 Matrices In R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.2 Stage-Classified Matrix Models . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.3 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9 Working With Strings 1479.1 Parsing Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.2 Producing Formatted Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.3 Plotting Special Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1599.4 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1619.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

III Extending R 165

10 Basic Scripts 16710.1Writing Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16710.2Evaluating Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16910.3Adding Comments To Your Code . . . . . . . . . . . . . . . . . . . . . . . . . . 17110.4Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17210.5Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

11 Programming 17511.1Looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17611.2Conditional Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Biological Data Analysis Using R

CONTENTS iii

11.3Outlining A Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18011.4Creating A Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18011.5Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18411.6Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18611.7Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

12 Functions 18912.1Function Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18912.2Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19412.3Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19512.4Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

A Answers to Exercises 197

B Installing Additional Libraries 199B.1 Library Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199B.2 Installing Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Bibliography 205

Index 205

Biological Data Analysis Using R

iv CONTENTS

Biological Data Analysis Using R

List of Tables

2.1 Common constants you will run across in R . . . . . . . . . . . . . . . . . . 11

4.1 Some useful additional commands to customize the appearance of a figure.For a complete listing of possible values that can be customized, try the ?par

command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Graphics devices for output of figures . . . . . . . . . . . . . . . . . . . . . . 51

5.1 Diversity of enrolled undergraduate students at Virginia CommonwealthUniversity in the College of Humanities & Sciences between the academicyears 1998-2008 as reported by the Center for Institutional Effectiveness(http://www.vcu.edu/cie/analysis/reports/sets.html). . . . . . . . . . . . . 75

8.1 Table of life history values separated into A Fertility estimates (the fX items)and B transition probabilities depicting the movement between stages andwithin stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.1 Caption For Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

v

vi LIST OF TABLES

Biological Data Analysis Using R

List of Figures

1 Example scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

4.1 Values for the density function for the χ2 distribution with 1, 2, and 3 de-grees of freedom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 A graphical depiction of the critical value of the χ2 distribution for α = 0.05and df = 3. The shaded region constitutes a proportion of the area underthe curve equal to α. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Some example graphs with alternate values for symbols, line types, widths,colors, and titles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Plot of two data sets using the par(new=T command but not taking into con-sideration the axis limits of the two data sets before plotting. . . . . . . . . . 50

4.5 Plot of two variables on the same axis after correcting for the range of eachdata set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Image of colored Poisson distribution that was copied from the graphicsdevice to a jpeg file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.7 Examples of the densities of two normal distributions; the red one is drawnfrom a random normal distribution with default values of µ = 0 and σ = 1and another in blue that has µ = σ = 5. . . . . . . . . . . . . . . . . . . . . . . 55

4.8 Histogram with labels and main title changed. . . . . . . . . . . . . . . . . . 564.9 Histogram of 1000 random numbers drawn from a Poisson distribution

with the λ parameter set to 5. The red line indicates the density of the values. 574.10Example locations for first two moments of a Normal (N(0, 1)) distribution. . 594.11Negative (left) and positive (right) distributions. In both of these examples

the dotted line connects the mode of the distribution (the top peak) to themean (on the x axis). The direction of this lean determines if the distributionhas a negative (left) or positive (right) skew. . . . . . . . . . . . . . . . . . . . 60

4.12Three distributions )exponential, normal, and logistic) showing differentlevels of kurtosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.13Matrix of four plots created from random numbers sampled from the nor-mal, poisson, exponential, and the logistic distributions. . . . . . . . . . . . 62

4.14Distribution of random number drawn from rpois(1000,5). . . . . . . . . . . . . 644.15Scatter plot of some semi-random points. . . . . . . . . . . . . . . . . . . . . 654.16Example plot of two variables used to test correlations. . . . . . . . . . . . . 66

5.1 Undergraduate diversity at Virginia Commonwealth University during aca-demic years 1998, 2003, & 2008. . . . . . . . . . . . . . . . . . . . . . . . . . 77

vii

viii LIST OF FIGURES

5.2 Boxplot of Pinus echinata germination data partitioned by timber extractiontreatment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.1 Plot of single variable regression values. . . . . . . . . . . . . . . . . . . . . . 926.2 Regression model added to plot of points using abline function. . . . . . . . . 946.3 Regression model with fitted line and formula. . . . . . . . . . . . . . . . . . 966.4 A 2x2 matrix plot of some diagnostic tools associated with a linear model.

They include a plot of the residuals (eij) as a function of the fitted values (yi)to see if there are systematic biases in the model (upper left), a Q-Q plot toexamine normality of the residuals (upper right), a scale location plot (lowerleft), and a leverage plot to look for outliers (lower right). . . . . . . . . . . . 97

6.5 Boxplot of germination percentages for Pinus echinata as a function of treat-ment. A colored rug was added to the right side to show the actual valueswithin treatments (see rug. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.6 Confidence intervals for difference in mean germination rates for Pinus echi-nata families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.1 The image represented in the r.pbm file. This image has been scaled upto make it large enough to see it on the page using the program GIMP(www.gimp.org). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2 A PBM file that was programatically created in R . The image is rotatedbecause of the default location of the origin. . . . . . . . . . . . . . . . . . . . 112

7.3 The image represented by the dog.pgm file. This image has been scaled upto make it large enough to see it on the page using the program GIMP(www.gimp.org). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.4 The image represented in the Libbie.ppm file. This image has been scaledup to make it large enough to see it on the page using the program GIMP(www.gimp.org). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.5 The original image along with ones where only the red, green, and bluechannel turned on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.6 The greyscale translation of the PPN image, a histogram of the grey valuesand the image resulting from reducing all the grey values in the image byhalf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.7 A random image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.8 A random image with a square doughnut hole in the middle. . . . . . . . . . 118

8.1 Image depicting two vectors vred = [4, 2] and vblue = [2, 1] that are projectingin the same direction but have different magnitudes. . . . . . . . . . . . . . 131

8.2 The A graphical depiction of the life history stages in the fictitious plantGrenus growii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.3 Effects of the instantaneous growth rate λ as a function of time for bothexponential growth (λblue = 1.2) and exponential decay (λred = 0.8). . . . . . . 136

8.4 Examples of two different calls to the plotting function barplot(). The param-eters used to create these plots is given in the R code. . . . . . . . . . . . . . 138

8.5 Example of a stacked bar plot with multiple categories represented in eachTreatment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.6 Size of the four stage classes through time. . . . . . . . . . . . . . . . . . . . 142

Biological Data Analysis Using R

LIST OF FIGURES ix

8.7 Differences in estimated proportions of individuals in each stage from whatwas expected through time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

9.1 Histogram of distance estimates among all sequences using the ”K90” modelof substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.2 Neighbor joining tree based upon the trnL-trnF intergenic spacer sequencesand the ”K90” model of sequence evolution. . . . . . . . . . . . . . . . . . . . 156

9.3 The html printout of a xtable as interpreted in Firefox. You can also importtables saved as html into popular word processors and use them as normaltable items in the creation of your documents. . . . . . . . . . . . . . . . . . 159

9.4 Example of using the expression function to annotate a graphic. . . . . . . . . 161

11.1Hemispherical photograph of winter roosting habitat at Monarch BiosphereReserve, Mexico. Photo by S.B. Weiss made available by the Creative Com-mons Atribution 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11.2The blue channel of the canopy picture displayed as a greyscale image. . . . 18111.3A histogram of values in the blue channel (Figure 11.2). . . . . . . . . . . . . 18111.4Intensity of blue channel values in the image as taken through a slice of

the image (at pixel row 230 as indicated by red dashed line). . . . . . . . . . 182

B.1 Example of CRAN mirror window as viewed on Linux . . . . . . . . . . . . . 201B.2 All packages that can be installed from the selected mirror server on my

machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

Biological Data Analysis Using R

x LIST OF FIGURES

Biological Data Analysis Using R

Preface

This manuscript was written to scratch a particular itch that I felt was not being sati-ated.Increasingly students in biological research programs, both at the undergraduateand the graduate level, are dealing with data sets that are both enormous in size andvaried in representation. Image data, sequence data, counts of species in communities,nutrient flux, reaction networks, and a whole host of other kinds of data are encounteredon a daily basis in biological sciences. In order to ”drink from this firehose” of data, it isimportant that we have the correct kinds of tools; the spreadsheet metaphor is no longervalid.

After spending a few years encouraging students to learn a tool, any tool, that wouldhelp them deal with the complexity of data we encounter, I decided to put together acourse focusing on how R can be used to deal with many different kinds of data. Thiscourse was designed for incoming graduate students in Biology at Virginia Common-wealth University with the goal of getting them familiar with R from the beginning oftheir graduate work. Many of the graduate faculty in Biology use R in their coursesand find that a non-trivial amount of time needs to be spent on introducing studentsto R in each course which is taking away from the focus of the course. However, if astudent had taken a short course in R when they began their graduate work, then itwould be possible to spend more time in our individual courses focusing on the topic athand.

This manuscript is not designed to be one of the ”Biological discipline X in R ” kind ofofferings; there are already a lot of those kinds of books available. My goal here is tointroduce the reader to a wide variety of data types that we deal with in Biology andgive a brief introduction to how R can be used to interact with, and perhaps performanalyses, on these data. The treatment of any one kind of data is relatively shallow,as I am assuming that students are going to take a specific course on that topic in thefuture. And when they do, they will have already seen how R will make their life easier.In my own research, I use tools such as R in many different circumstances and feelthat students can only benefit from a broad understanding of how R can assist in theirresearch. With this focus, it is no coincidence that the kinds of data introduced in thistext are pulled directly from the graduate courses that our students will take, such asCommunity Ecology, Population Genetics, Population Ecology, Evolution & Speciation,Biological Complexity, Molecular Genetics, Landscape Genetics, Bioinformatic Technolo-gies, Ecological Genetics, and Quantitative Ecology.

Give the range of topics covered herein, I think this manuscript has a broad audienceas I assume that the reader of this text will not have much previous experience using R.

xi

xii Preface

Obviously, incoming graduate students are my primary audience. However, I also feelthat this would also be a good beginning text for one who is already working in the fieldand would like to gain a broader introduction to how R can be used in their particulardiscipline.

Contents

This manuscript has been partitioned into four separate sections. The first section intro-duces R as a language and a tool and covers some basic topics that are required to getone going. The next section contains eleven chapters that target some particular aspectof biological inquiry from the perspective of the kind of data that will be analyzed. Thethird section focuses on how you can extend the R environment developing scripts anddefining your own functions and libraries. The final section of this text is an appendixthat includes the answers to odd-numbered questions from the exercises in each chap-ter as well as some additional information on installing additional libraries or groups oflibraries.

There are some common elements to each chapter that make it easy for the reader toget the larger picture of the topics being introduced. At the beginning of each chapter, aspecific list of topics and skills that are to be covered provided. As topics are introduced,the R code is provided and keywords from the R programming language are highlightedto help the reader follow along.

At the end of each chapter all the R functions that were used in the chapter as well as abrief definition of the arguments passed to each function is provided as a quick referencesource. Each chapter also contains a set of exercises that can test the readers under-standing of chapter topics. Answers to odd numbered exercise problems are provided inAppendix A. Throughout the text, all of the R functions used are also indexed so thatthe reader can easily find instances where they were used.

Part 1: Basic Usability

The first part of this manuscript contains the basic information that is required to installand begin using R for data analysis. This section has the following chapters:

Chapter 1: Getting R This chapter provides information on how to download the latestbinary release for R as well as compiling it from source code. Particular attention is paidto the differences associated with installing R on different platforms.

Chapter 2: Language & Grammar This chapter begins introducing the R programminglanguage by focusing on the different kinds of data types that are used (e.g., integers,decimal values, factors, . Topic covered include a basic overview of what a function is,an introduction to the most commonly used data types in R , and general operations onthese data types.

Biological Data Analysis Using R

Preface xiii

Part 2: Biologically Motivated Topics

The second section of this manuscript contains the main content.

Chapter 3: Data Frames The data frame is a fundamental object in R . This chapterbuilds upon the basic understanding of data frames (introduced in Chapter 2) by in-troducing several methods for putting your data into new and existing data frames,persistent storage of data frames. This chapter also introduces the concept of using thedata frame data type as a light-weight database object. This includes an introduction tomaking slices of a data set, the methods required to make complex selections of subsetsof data, and joining data from multiple data frames.

Chapter 4: Summary Statistics This chapter introduces the reader to general summarystatistics for continuous data, statistical distributions, and random number generation.This chapter also provides the reader a first introduction to creating publication-qualitygraphics in R . General graphics include scatter and line plots, histograms, densityplots, plotting several graphical objects on the same set of axes, creating matrices ofplots, and saving graphics to file.

Chapter 5: Categorical Data This chapter focuses on the analysis of categorical dataand contingency tables. Give the ubiquity of the χ2 test in Biology, a general treatmentof contingency tables is provided with examples demonstrating how to examine geneticlinkage disequilibrium, Hardy-Weinberg equilibrium, and demographic analysis testingfor equality of population diversity. Both parametric and non-parametric approaches areintroduced with examples.

Chapter 6: Linear Models This chapter introduces the concept of linear models from sim-ple correlations through single and multiple regression and ANOVA (which is introducedas regression with categorical predictors). Data for this chapter is derived from my ownthesis working with the consequences of landscape modification on reproductive successin canopy trees. Examples of model diagnostics, model selection and post-hoc tests arealso covered.

Chapter 7: Working With String Data This chapter uses genetic sequence data as anexample of string-related data that can be manipulated in R . Basic skills in stringsearching and replacements are augmented with a short discussion of genetic sequencealignments, the use of online genetic databases such as NCBI, and the creation of aphylogenetic trees using different algorithms is demonstrated.

Chapter 8: Image Data This chapter focuses on image creation, importation, analysis,and manipulation. After a basic overview of image formats and manipulations, hemi-spheric canopy photos are used as an analysis topic on which several analyses arepreformed.

Chapter 9: Matrix Analysis Matrix analysis is a general tool used in a variety of biologicaldisciplines. In this chapter, the topic of life history analysis and population projection isused as an example for matrix operations in R .

Chapter 10: Multivariate Data Ordination techniques are a broad class of methodologiesthat seek to understand the structure of multivariate data. In this chapter, vegetationdata is used as an example of how one conducts and interprets basic ordination.

Biological Data Analysis Using R

xiv Preface

Chapter 11: Classification This chapter focuses on how morphological shape analysiscan be used for classification purposes. Morphological data from the bark beetle speciescomplex, Araptus attenuatus is used as an example for comparison with genetic classifi-cation schemes.

Chapter 12: Spatial Data In this chapter, the analysis of spatial data is introduced.Topics covered include, conversion of GPS way points and GIS data files into R dataformats, plotting georeferenced raster and vector maps, and basic spatial analysis.

Chapter 13: Genetic Data This chapter focuses on how one can represent genetic datain R and perform basic analyses on genetic structure. Examples include the analysisof inbreeding, population structure, association mapping, and population assignmenttests.

Part 3: Extending R

The chapters in this section only require a basic understanding of R and can be used atany time as they are stand-alone. In fact, it is suggested that after you get familiar withR , you should look into these chapters because they contain valuable information thatwill make your life easier.

Chapter 14: Creating Basic Scripts This chapter addresses how to you create basic Rscripts so that you can reuse your code and analyses as well as have persistence acrossyour R sessions.

Chapter 15: Programming R This chapter covers basic programming, flow control, anddecision control statements.

Chapter 16: Functions This chapter demonstrates how the user can create individualfunctions from their scripts so that calling complex analyses and operations can besimplified.

Appendices

The last part of this manuscript includes supplementary material in support of the con-tents.

Appendix A: Answers to Exercises This appendix provides answers to the odd numberedproblems located at the end of each chapter.

Appendix B: Installing Additional Libraries There are a broad range of libraries that theR community provides and this appendix shows you how to find and install additionallibraries to your local copy.

Typographic Conventions

The developers of R have worked very hard to make sure that you can interface with Ron any platform without worrying about which operating system you are using. However,

Biological Data Analysis Using R

Preface xv

there are some times when things are slightly different on alternate platforms. Whenthere are platform specific issues to be dealt with, I will make a notation in the marginswith the name of the operating system next to the text to indicated specific issues. OS

The book is not going to show you how to interact with R using a GUI, because in myopinion GUI’s are for babies. If you want to learn how to use R , you will have to learnhow to interact with it from the command line and write scripts for R to analyze yourdata. If you want to a point-and-click interface for a statistical analysis program thenperhaps you should check out SPSS (Statistical Package for Social Sciences) or similarofferings. It is my belief that you will learn more about programming and data analysisif you learn the R language. There are only so many options that GUI-based analysescan provide but with R on the command-line, you will be have the most flexibility in theanalysis of your data. Moreover, when you create scripts to perform your analysis, youwill have a persistent record of how you analyzed the data instead of just some data andresults. Increasingly, peer-reviewed journals are suggesting that your analysis scriptsbe included in your supplementary materials for general consumption.

Throughout this book, I will provide examples of code in a box format. You will be able totell what is code that can be entered in R because it will be separated from the main textand in an alternate font, slightly shaded, and with R keywords colored appropriately.For example, the commands:

> x <− seq(0 ,100 ,by=2)> y <− rnorm(51)> plot ( x , y , xlab="X Axis" , ylab="Y Axis" )

create a scatter plot for the variables x, a sequence of even numbers from 0 → 100 andy which are random numbers sampled from a normal distribution. The result is givenin a new graphics window with a plot similar to what is shown in Figure 1. How plotsare made and saved to a file for subsequent use is covered in depth though out thebook. I have decided to sprinkle instructions of how to create graphics into the text atlocations that are appropriate for the content being discussed rather than creating oneor more chapters on Graphics with made up data presented out of context with how thatparticular graphical representation is appropriate.

In all code provided in this text will have text highlighting showing R keywords in darkblue and strings in red (see Chapter 2 for more information on these commands). If youare using a good editor to write your scripts, you will see this kind of text highlightingin your own work. In these code listings the > character is the prefix given by R and isnot typed. I provide it here because I want to differentiate between code you type andanswers that are given by R , which will not have the > character in it such as:

> 2 ∗ 6[1 ] 12> rnorm(10)

[1 ] −1.08495736 −1.25010428 −0.76237538 −0.08486045 −1.62145675 −0.54872689[7 ] 0.64345848 0.43850325 0.26551658 −0.41362136

> pi/2[1 ] 1.570796

where the answers are given in the line immediately following what was entered. Alongwith the answer is also an index for the answer or answers. For example, the second

Biological Data Analysis Using R

xvi Preface

Figure 1: Example scatter plot.

example gets 10 random numbers from a normal distribution but can only give 6 on aline before it wraps around. The [7] tells you that the first number on the second lineis the seventh in the sequence. When you operate on vectors or matrices, these indicesare relatively important and allow you to easily find specific indices rapidly.

Acknowledgments

There are several people I would like to acknowledge for their assistance in this work.This has been possible primarily due to the flexibility of my Department in allowing meto ”experiment” on our graduate students. Next, I wish to thank Dr. James Vonesh whohas goaded me into putting this together and been my colleague in crime as we continueto push R as a general tool in our curricula. Members of my laboratory Stephen Baker,Daniel Carr, Candace Dillion, Crystal Meadows, and Cathy Viverette sat through the firstiteration of the course and have provided insightful feedback on the both the focus andthe content. I would also like to thank the developers of R, LATEX, Grass GIS, Emacs, andVim who have provided a set of tools that facilitate good research.

Rodney J. DyerRichmondJune 2009

Biological Data Analysis Using R

Part I

Basic Usability

1

Chapter 1

Getting R

I am not going to spend much time on how you go about getting and installing R onyour computer. If you are going to use a machine on campus, it should have it alreadyinstalled on it. If not, VCU does not allow students to install programs on their ma-chines so this Chapter is somewhat irrelevant anyways. However, if you are using yourown computer (which is always the best idea), the internet has a much more in-depthand complete iteration of how to get and install the R environment for your particularmachine. Reproducing that here would be a waste of paper and both of our times as itwould probably be out of date before long.

1.1 What Is R

R is both a language and an interface for statistical analysis, programming, and graph-ics. R is modeled after the S language that was originally created by AT&T and in manycases scripts written for R can be run in S with little to no modification. R has be-come a standard interface for statistical analysis in biological sciences due in part to itsopenness, ability to be extended by users and it vibrant user base.

The R environment is a command-line interface that allows easy manipulation of data,calculation of parameters related to that data, an easy to understand grammar that fa-cilitates rapid program creation, and the ability to produce publication quality graphics.Moreover, you can create R scripts that describe how you analyzed your data so that inthe future you can pick up where you left off. Increasingly, entities such as NSF andprominent research journals are making R scripts a normal component of the Supple-mentary materials that you upload along with your research results and final reports. Itis my opinion that the sooner you start documenting your data and creating a history ofhow you perform analyses on this data, the better you will be in the long run.

3

4 CHAPTER 1. GETTING R

1.2 Where Do I Get It?

The main webpage for R is located at http://www.r-project.org/ Here you can find in-formation on the latest version of R available for your platform. Moreover, you can findsome nice screenshots, find out what is new in the R community, find links to man-uals, newsletters, wiki’s, and books on R . There is a lot of information in the onlinecommunity and in general, they are a friendly lot. Since R has been around for quite awhile, most of your most basic questions can be answered by a quick google search ofthe mailing list repositories. It is always a good idea to check these out prior to postingto a discussion board or email list so you do not get the old RTFM treatment...

1.2.1 Installation From Binaries

The CRAN site maintains pre-compiled binary distributions for Linux, Mac OSX andWindows. These binaries are the latest stable versions of the software and contain thebasic libraries that you need to run R on your operating system. Depending upon yourplatform, the package will contain an installer that allows you to clickity-click your waythrough the process and have a base R installation on your machine.

Connected from the main R site is also the CRAN repository where people make avail-able extensions to R that you can download and use. There is a tremendous variety ofsolutions available for you and it is always in your best interest to try to see if someonehas already tackled the problem you are working with. There is no reason to reinventthe wheel, your time is too valuable.

1.2.2 Compiling

If you know what a compiler is and have one on your computer then you are probablyable to compile the latest version of R on your machine. If you fall into this categorythen you do not need me to tell you how to proceed, there is a lot of good documentationon this found on the R website.

Biological Data Analysis Using R

Chapter 2

Language & Grammar

R is a language that has its own grammar and in this chapter you will be exposedto some basic concepts regarding these. In this and all subsequent Chapters, it isimportant for you to remember that computers do exactly what you tell them to, andoften not what you had wanted them to do. So learning the grammar is an importantstep in understanding R .

In this chapter, you will focus on the following topics:

• Learn basic data types and how to create them in R

• Understand various operators and how they can be used.

• Understand variable naming and be able to create, manipulate, and destroy

This is a pretty short list of things but it will take you a bit of time to get through it. Themain goal here is to understand a small subset of the different kinds of data that canbe produced in R and how we interact with them. Later, we will become more proficientwith them and add new data types as we move forward.

2.1 Overview

R itself consists of an underlying engine that takes commands and provides feedbackon these commands. From a technical perspective R is called a Function Language aseach command you give the R engine is either an:

Expression An expression is a statement that you give the R engine. R will evaluatethe expression, give you the answer and not keep any reference to it for future use.Some examples include:

> 2 + 6[1 ] 8> sqrt (5 )[1 ] 2.236068> 3∗ ( pi/2) − 1[1 ] 3.712389

5

6 CHAPTER 2. LANGUAGE & GRAMMAR

In each of these examples, R evaluates the expression and gives you an answer.When you use it like this, R is acting as a glorified calculator.

Assignment An assignment causes R to evaluate the expression and stores the resultin a variable. This is important because you can use the variable in the future. Anexample of an assignment is:

> x <− 2+6> myCoolVariable <− sqrt (5 )> another one number23 <− 3∗ ( pi/2) − 1> x[1 ] 8> myCoolVariable[1 ] 2.236068> another one number23[1 ] 3.712389

Notice here the use of the assignment operator <-. This is made with a ”lessthan” character and a ”minus” character. As for the expression, the variables x,myCoolVariable, and another one number23 are all the names of variables whosevalue was assigned with the expression. Also notice that to retrieve the value of avariable, just type it into the command line and it will provide the current value.

2.2 Function Quickie

This chapter will introduce you to several conventions, the main one of which is thefunction. A function in R is a collection of statements bound together to make it easierto use. In the previous example, I used the function sqrt(x), which is the function thatgives the square-root of the argument being passed (or an error if there is one).

Some functions are easy to understand and others are relatively complicated. We willspend a whole chapter on functions later in the book (see Chapter 12) when you be-gin to write your own. However, in the interim, you need to know a few things aboutfunctions.

1. A function has two parts; (1) a unique name, and (2) the stuff (e.g., variables) passedto it within the parentheses. Not all functions need any additional variables. Forexample, the function ls () shows which variables R currently has in memory anddoes not require any parameters.

2. If you forget to put the parentheses on the function and only use its name, bydefault R will show you the code that is inside the function (unless it is a compiledfunction). This is because each function is also a variable. This is why you shouldnot use function names for your variable names (see 2.3 for more on naming).

3. To find the definition of a function, the arguments passed to it, details of the imple-mentation, and some examples, you can use the ? shortcut. To find the definitionfor the sqrt() function type ?sqrt and R will provide you the documentation for thatfunction. If R cannot find the function you may have to do a more thorough searchusing the help.search("functionName") approach. This searches throughout the docu-mentation system and even uses some cool fuzzy searching techniques. For more

Biological Data Analysis Using R

2.3. VARIABLES 7

info on how to use help.search(), type ?help.search() (recursive logic recurses...).

4. Functions can be organized into libraries and only loaded when needed. At thetime of this writing there are just over 1600 different packages containing differentlibraries on http://cran.r-project.org. There is no reason to have every conceivablelibrary loaded and in fact if they were to be loaded would probably leave little mem-ory for you to work with your data on. As a rule of thumb, only load the librariesthat you need when you need them. More on libraries as we go forward.

5. Functions may have more than one parameter passed to it. Often if there area lot of parameters given then there will be some default values provided. Forexample, the log () function provides logarithms. The definition of the log functionshow log(x, base=exp(1)) (say from ?log). Playing around with the function shows:

> log (2 )[1 ] 0.6931472> log (2 ,base=2)[1 ] 1> log (2 ,base=10)[1 ] 0.30103

where without the optional base= parameter, it is clear that the log () function returnsthe natural log (in fact if you ?ln there is nothing found).

2.3 Variables

A variable is something that can hold an item for you. While this is a little bit of Dyer-speak, and I am sure that there are more elegant definitions, it is important to under-stand that variables are things that you will interact with. For example, you may havea predictor and a response variable you want to find a correlation between. It is yourresponsibility to define these variables and then you can subsequently use them in youranalyses.

There are some naming conventions that you can follow to make your life a bit eas-ier.

1. It is a pretty good idea for you to start your variable name with a letter. You cannotuse a number or punctuation as the first character of a variable (N.B. you can usea period to start it but the variable will be hidden from you and you cannot see itwith ls () so unless you know what you are doing, don’t to this).

2. Variable names cannot have spaces in them although it is possible to use periods(”.”), underscores (” ”), or you can use what is called camel case (e.g., NumberOfDogsInHouse;notice the use of upper and lower case letters to smush words together and makeit readable).

3. Try to name your variables something that makes sense to you. Using a,b,c,d,e,

and f as variables is probably not as informative to you when you are reading thecode as Rate, number of items, foodDataForNovember.

Biological Data Analysis Using R

8 CHAPTER 2. LANGUAGE & GRAMMAR

4. In R when you make a new variable such as x <−sqrt(2) then that variable is in mem-ory. You can recall it by typing its name and hitting return, you can use it later infunctions or calculations, and you can manipulate it (e.g., x <−x/2 to decrease it byhalf).

5. The function ls () provides you a list of all variables that you have defined. It is a veryhelpful function. You can remove a variable from memory using the rm(variableName)

function.

2.4 Data Types

R recognizes about a dozen different types of data. While it is important to know thedifferences between these data types, you will probably use only a fraction of them. Allof the data types are characterized by what R calls classes. As such, every data type hasthree common functions associated with it; a constructor that creates a specified type,as introspection function that tells you if any variable is a particular type, and a castingfunction that allows you to coerce the contents of a variable into a specific type (a morecomplete discussion of functions can be found in ??). This may sound a bit confusingbut in reality it is pretty straight forward. For example, the constructor is the functiontype(x) will create a vector of x types ( where type is the data type will create), is.type(x) todetermine if x is that particular type of variable, and as.type(x) will return x translated intoa type of variable. Confused yet? It really isn’t that bad, examples for each data typebelow will discuss the specifics.

To determine the type of any variable you can use the built-in function class(x). This willtell you what kind of variable x is and is relatively important in the discussions we aregoing to have below about coercion. This is an important concept for understandingdata types. What follows is a brief discussion of each data type and where appropriatean example of the use of one, how to access it, and how we can operate on it.

2.4.1 Integers

An integer is a common counting number (e.g., one without a fractional part). Techni-cally, integers can range from −∞↔∞ however, in practice there is a limited amount ofintegers that can be defined on the range ±2 ∗ 109. The integer type is typically used inthe development of R libraries who need to pass succinct integers to C or FORTRAN codeand is not typically used by the normal R end user.

Check out the code listing below and see how one can create, coerce, and use an inte-ger.

> integer (5 )[1 ] 0 0 0 0 0> x <− as . integer (5 )> x[1 ] 5> i s . integer ( x )[ 1 ] TRUE> class ( x )

Biological Data Analysis Using R

2.4. DATA TYPES 9

[ 1 ] "integer"> x + 2[1 ] 7> class ( x+2)[1 ] "numeric"> y <− integer (3 ) + 2> i s . integer ( y )[ 1 ] FALSE> y <− integer (3 ) + as . integer (2 )> i s . integer ( y )[ 1 ] TRUE

There are some things to notice about this:

1. The command integer(5) produces a vector (see 2.4.8) of five integers.

2. All of the items returned from the listing (5) function were assigned a value of zero(0), which is the default value for an integer until its value is changed to somethingelse.

3. The variable x is assigned a particular integer, in this case 5, and is verified by theclass(x) statement.

4. You can perform operations on integers you need to make sure that you use otherintegers. For example, adding 2 to the vector of integers represented by the variabley produces a ”numeric” type, not an integer. Whereas the integer(3) + as.integer(2) state-ment does return an ”integer” type. This is your first example of coercion, whereone data type is ”magically” turned into another type. There are rules for thesetransformations and the first one you should recognize is that the number 2 is notconsidered an integer. By default, numbers are coerced into numeric values (see2.4.2) as integers are not used that often.

5. When adding an integer as.integer(2) to a vector of integers every element is assignedthe same number. There are a few more subtle things to know about adding thingsto vectors and I’ll leave that until 2.4.8.

As I said above, the integer type is not used that often and is only provided here forcompleteness.

2.4.2 Numeric

Numeric types represent the majority of number valued items you will deal with. Whenyou assign a number to a variable in R it will most likely be a numeric type (unless youspecify otherwise such as defined in 2.4.5 and 2.4.6). Numeric data types can eitherbe displayed with or without decimal places depending if the value(s) include a decimalportion. For example:

> x <− numeric (4 )> x[1 ] 0 0 0 0> x [1 ] = 2.4> x[1 ] 2.4 0.0 0.0 0.0

Biological Data Analysis Using R

10 CHAPTER 2. LANGUAGE & GRAMMAR

Notice this is an all or nothing deal here. Also notice (especially those who have someexperience in programming other languages) that dimensions in vectors (and matrices)start at 1 rather than 0.

Operations on numeric types proceed as you would expect but since the numeric typeis the default type, you don’t really have to go around using the as.numeric(x) function. Forexample:

> i s . numeric (2 .4 )[1 ] TRUE> as .numeric (2 ) + 0.4[1 ] 2.4> 2 + 0.4[1 ] 2.4

shows that no matter how you do it, 2.4 is a numeric data type. In general, programmersare lazy people who try to do things that minimize the amount of typing they have to do(since they do a lot of typing to begin with) and as such the numeric type is the easiestto use.

2.4.3 Character

The character data type is the one that handles letters and letter-like representations ofnumbers. For example, observe the following:

> x <− "some sequence of letters of length 37"> class ( x )[ 1 ] "character"> y <− 23> class ( y )[ 1 ] "numeric"> z <− as . character ( y )> z[1 ] "23"> class ( z )[ 1 ] "character"

Notice how the variable y was initially designated as a numeric type but if we use theas.character(y) function, we can coerce it into a non-numeric representation of the number...There will be times when you need to translate various things into characters, such aswhen making titles and axis labels and this will come in handy.

You need to think of the numeric type as a sequence of letters, numbers, symbols, orother stuff you can produce by pushing keys on your keyboard that are enclosed ineither single or double quotations. It doesn’t really make much sense to perform anyoperations on a character type (e.g., what would you expect ”hello”*3 to accomplish)although you can paste() them together. For example,

> x <− "I am"> y <− "not"> z <− ’a looser’> x[1 ] "I am"> y[1 ] "not"

Biological Data Analysis Using R

2.4. DATA TYPES 11

> z[1 ] "a looser"> paste ( x , y , z )[ 1 ] "I am not a looser"> paste ( x , z )[ 1 ] "I am a looser"

It is important to note that if you are a really anal person for perfection that the paste()

function by default separates the individual variables you give it with a single space.However, this can be modified by telling the function what to use as the separator).

> paste ( x , z , sep=" not " )[ 1 ] "I am not a looser"> paste ( x , z , sep=", " )[ 1 ] "I am, a looser"

2.4.4 Constants

Constants are variables that have a particular value associated with them that cannotbe changed. They are mostly here for convienence so that we do not have to go lookup values for common things. Below are listed some common constants that you willprobably encounter as you play with R .

Table 2.1: Common constants you will run across in R

Constant Description

pi The mathematical constant, π representing the ratio of a circles circumferenceto its diameter.

NULL The absence of a type. This is the oubliette, complete nothingness, /dev/nullRichmond on a Wednesday night... This is commonly used by functions that returnundefined responses.

nan Not a number.Inf Infinity (∞) as well as -Inf for −∞.NA Typically used to represent something that is not there or missing. You can use it

for missing data if you like.

For the non-numerical constants, there are commands such as is.NULL(), is.nan(), is. infinite

(and its cousin is. finite () ), and is.na() to help you figure out if particular items are of thatconstant type if you like. At times this can be handy such when you have missing dataand you want to set it to some meaningful value (e.g, is.na(X) <−32 will set all NA valuesin X to 32). We’ll get into this more in depth at a later time.

2.4.5 Complex Numbers

Complex numbers are those that can be written in the form a + bi where a is the realpart and the product bi being the imaginary part with i =

√−1. The code snippet below

shows you how to create and query the class of a complex number.

Biological Data Analysis Using R

12 CHAPTER 2. LANGUAGE & GRAMMAR

> w <− complex (3 )> w[1 ] 0+0 i 0+0 i 0+0 i> x <− complex (3 ,4 ,5)> y <− 4+5 i> x[1 ] 4+5 i 4+5 i 4+5 i> y[1 ] 4+5 i> i s . complex ( x )[1 ] TRUE> i s . complex ( y )[ 1 ] TRUE

The main differences here in the constructor complex() from the other ones we have seenso far is that it can take default values. For example, when called as complex(3), it returnsthree complex numbers whose real and imaginary parts are set to zero. However, callingthe function as complex(3,4,5) makes a three complex numbers each assigned a four tothe real part and a five to the imaginary part. As shown, you can also create complexnumbers by simply typing them directly on the command line as a+ bi as shown and isprobably the easiest way to do it.

2.4.6 Raw

The raw data type is a hexadecimal data type bound on the inclusive range [0 − 255].Raw numbers are represented as a two digit sequence of hex numbers. Valid hex digitsinclude 0− 9 as well as a, b, c, d, e, & f . The listing below gives you some examples of howto create some raw data types.

> raw (3 )[1 ] 00 00 00> as . raw(255)[1 ] f f> as . raw(13)[1 ] 0d> as . raw(256)[1 ] 00Warning message :out−of−range values treated as 0 in coercion to raw> i s . raw(13)[1 ] FALSE> i s . raw(0d )Error : unexpected symbol in "is.raw(0d"> x <− 0dError : unexpected symbol in "x <− 0d"

There are several important points to make here.

1. If you try to create a raw number outside the its allowable range, R will issue youa warning and then assign the variable the default value, 00.

2. The digits 13 while valid raw digits are not considered raw given by themselves. Thisis because all numbers are considered numeric data types (see 2.4.2) by R , evenin the case of 0d which is definitely a raw hex number, R doesn’t coerce it into araw type but leaves it as the characters 0d and then chokes on it. This is probablygood behavior.

Biological Data Analysis Using R

2.4. DATA TYPES 13

3. Similar to what was shown for integers, raw numbers must be constructed fromthe constructor raw() function and cannot be directly created by simply pairing upvalid digits.

2.4.7 Logical

Logical data types are boolean variables with a value of TRUE or FALSE. Obviously, thesetwo values are the opposites of each other (e.g., not TRUE is FALSE, etc.). You will en-counter logical data types in two primary situations; (1) when you are writing a condi-tional statement that requires you to know the truth about something (e.g., if x == 0 youprobably shouldn’t try to divide by x because for some reason mathematicians haven’tfigured out how to divide by zero yet...), or (b) if you are tying to select some subset of yourdata by using a particular condition (e.g., select all entries where color == ”blue”).

The interesting thing about logical variables is that numbers can be coerced into a logicalvariable. For example the number zero, as an integer, numeric, complex, or raw datatype, is considered to be FALSE whereas any non-zero value is considered TRUE.

2.4.8 Vectors

R is a vector language and as you begin to learn more and more of it you will appreciatethe fact that you can easily work with vectors of numbers as well as single ones. In fact,I suppose it is probably better to think of a single number as a vector of length 1, whichis why the R command line interface puts the [1] after every answer...

A vector is a sequence of items that can be created using the function vector(). However,since a vector is simply a sequence, it can be a sequence of any type of data. For example,I may have a vector of integers or a vector of complex numbers, or whatever. To specifythe data type for a vector, must tell it what type to use. Here is an example using the”numeric” data type.

> x <− vector ("numeric" ,3)> x[1 ] 0 0 0> i s . vector ( x )[ 1 ] TRUE> i s . numeric ( x )[ 1 ] TRUE

Notice that it assigns default values for each entry as would be expected. However, it isalso important to notice that not only is x a vector but it is also numeric! So in actuality,in all the preceding cases where we have used the constructor to create a new datatype they are also creating vectors! Blows you mind doesn’t it! This is why it is safe toconsider R as a vector language.

Because you will use vectors so much, there is an easier way to create the using thec() function (c for combine). This is a short-hand version and R tries to determine thetype of variables that you pass to the c() function to do the right thing c©. Here are someexamples:

Biological Data Analysis Using R

14 CHAPTER 2. LANGUAGE & GRAMMAR

> x <− c (1 ,2 ,3)> x[1 ] 1 2 3> y <− c (TRUE,TRUE,FALSE)> y[1 ] TRUE TRUE FALSE> z <− c ("I" ,"am" ,"not" ,"a" ,"looser" )> z[1 ] "I" "am" "not" "a" "looser"> notGoingToWork <− c(00 ,0b, f f )Error : unexpected symbol in "notGoingToWork <− c(00,0b"

The only caveat here is that if the data type cannot be determined unambiguously, thenR will choke and tell you so, as shown in the last example where I was trying to make avector of raw data types. For cases such as these, use the normal data type constructor(e.g., raw(3)) and then assign values to each element.

To access an element in a vector, R uses square brackets ([]) as demonstrated here:

> x <− vector ("numeric" ,3)> x[1 ] 0 0 0> x [1 ] <− 2> x [3 ] <− 1> x[1 ] 2 0 1> x [2 ][1 ] 0

Since working with a vector is such a common thing, there are a number of helperfunction that you can use to make vectors.

> x <− 1:6> x[1 ] 1 2 3 4 5 6> y <− seq (1 ,6)> y[1 ] 1 2 3 4 5 6> z <− seq (1 ,20 ,by=2)> z

[1 ] 1 3 5 7 9 11 13 15 17 19> rep (6 ,4)[1 ] 6 6 6 6

The notion x : y provides a vector of whole numbers from x to y. In a similar fashion thefunction seq(x,y,by=z) provides a sequence of numbers from x to y but can also have theoptional parameter by= to determine how the sequence is made (in this case the by 2s forall the odd numbers from 1 to 20). The function rep(x,y) repeats x a total of y times. Theseare some real time saving options and you will probably be using them often.

2.4.9 Matrices

Matrices are 2-dimensional vectors and can be created using the default constructormatrix() function. However, since they have 2-dimensions, you must tell R the size of thematrix that you are interested in creating by passing it a number for nrow and ncol for thenumber of rows and columns.

Biological Data Analysis Using R

2.4. DATA TYPES 15

> matrix (nrow=2,ncol=2)[ ,1 ] [ , 2 ]

[1 , ] NA NA[2 , ] NA NA> matrix (23 ,nrow=2,ncol=2)

[ ,1 ] [ , 2 ][1 , ] 23 23[2 , ] 23 23

If you do not give matrix() a default value to put in each cell, it will fill them with NA, whichis the way R indicates a missing value.

Matrices can be created from vectors as well.

> x <− c (1 ,2 ,3 ,4)> x[1 ] 1 2 3 4> i s . vector ( x )[ 1 ] TRUE> i s . matrix ( x )[ 1 ] FALSE> matrix ( x )

[ , 1 ][1 , ] 1[2 , ] 2[3 , ] 3[4 , ] 4> y <− matrix ( x ,nrow=2)> y

[ ,1 ] [ ,2 ][1 , ] 1 3[2 , ] 2 4> i s . matrix ( y )[ 1 ] TRUE> i s . vector ( y )[ 1 ] FALSE

Be default, if you do not provide any dimension to the matrix() function, it will produceone with a single column of data. If you provide one of the dimensions then it will tryto determine how many of the other dimension is needed by looking at the length of thevector that you passed (e.g., here nrow=2 was given and it figured out that it should havetwo columns as well).

There is a slight gotcha here if you are not careful.

> x <− 1:4> matrix ( x ,nrow=4,ncol=2)

[ ,1 ] [ , 2 ][1 , ] 1 1[2 , ] 2 2[3 , ] 3 3[4 , ] 4 4> matrix ( x ,nrow=3)

[ ,1 ] [ , 2 ][1 , ] 1 4[2 , ] 2 1[3 , ] 3 2Warning message :In matrix ( x , nrow = 3) :

data length [4 ] is not a sub−multiple or multiple of the number of rows [3 ]> matrix ( seq (1 ,8 ) , nrow=4)

[ ,1 ] [ , 2 ]

Biological Data Analysis Using R

16 CHAPTER 2. LANGUAGE & GRAMMAR

[ 1 , ] 1 5[2 , ] 2 6[3 , ] 3 7[4 , ] 4 8

Notice here that R added the values of x to the matrix until it got to the end. However,it did not fill the matrix so it started over again. In the first case the size of x wasa multiple of the size of the matrix whereas in the second case it wasn’t but it stillassigned the values (and gave a warning). Finally, as shown in the last case, if they areperfect multiples, then it fills up the matrix in a column-wise fashion.

To access values in a matrix you use the square brackets just as was done for the vectortypes. However, for matrices, you have to use two indices rather than one.

> X <− matrix ( c (1 ,2 ,3 ,4 ,5 ,6) ,nrow=2)> X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 3 5[2 , ] 2 4 6> X[1 ,3][1 ] 5> X[2 ,2] <− 3.2> X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 3.0 5[2 , ] 2 3.2 6> X[1 , ][ 1 ] 1 3 5> X[ ,3 ][1 ] 5 6

We will use matrices quite a bit but will delay the commentary on matrix algebra andoperations until Chapter 8. However, the last two operations provide a hint as to someof the power associated with manipulating matrices. These are slice operations whereonly one index is given (e.g., X[1,]) provide a vector as a result for the entire row orcolumn.

2.4.10 Factors

Factors are a particular kind of data that is used in statistics and sampling. You canthink of a factor as a categorical treatment type that you are using in your experiments(e.g., Male vs. Female or Treatment A vs. Treatment B vs. Treatment C). Factors canbe ordered or unordered depending upon how you are setting up you experiment.

Most factors are given in as characters so that naming isn’t a problem. Below isan example of five observations where the categorical variable sex of the organism isrecorded.

> sex <− factor ( c ("Male" ,"Male" ,"Female" ,"Female" ,"Unknown" ) )> l eve ls ( sex )[1 ] "Female" "Male" "Unknown"> table ( sex )sexFemale Male Unknown

2 2 1> sex [5 ] <− "Male"

Biological Data Analysis Using R

2.4. DATA TYPES 17

> sex[1 ] Male Male Female Female MaleLevels : Female Male Unknown

Here the table() function takes the vector of factors and makes a summary table from it.Also notice that the levels () function tells us that there is still an "Unknown" level for thevariable even though there is no longer a sample that has been classified as "Unknown" (itjust currently has zero of them in the data set).

2.4.11 Lists

A list is a convienence data type whose function is to group other data items.

> theList <− l i s t ( x=seq (2 ,30) , dog=LETTERS[1 :5 ] , hasStyle= log ica l (5 ) )> summary( theList )

Length Class Modex 29 −none− numericdog 5 −none− characterhasStyle 5 −none− l og i ca l> theList$x

[1 ] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26[26] 27 28 29 30

$dog[1 ] "A" "B" "C" "D" "E"

$hasStyle[1 ] FALSE FALSE FALSE FALSE FALSE

> theList$x[1 ] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

[26] 27 28 29 30> theList$x [2 ][1 ] 3> theList$x [2 ]<− 22> theList$x

[1 ] 2 22 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26[26] 27 28 29 30> theList$dog [2 ][1 ] "B"> theList$MyFavoriteNumber <− 2.9 + 3 i> theList$MyFavoriteNumber[1 ] 2.9+3 i

As you can see, a list can contains a range of different types of data. The summary()

function gives, not to surprisingly, a summary of the items within the list. These dataare grouped together by the list but you can access them and manipulate them just asyou would if they were a stand alone variable with the exception of the list name and thedollar sign. R uses the dollar sign $ frequently to designate something that is containedwithin something else. You will find when you conduct analyses and assign the resultsto a variable that variable will be a list and to access predicted values, or error terms, orother components of that analysis you will do so by using the $ nomenclature.

It is important to remember that lists are general groupings of variables and these vari-ables do not necessarily have any relationship between them other than my need to

Biological Data Analysis Using R

18 CHAPTER 2. LANGUAGE & GRAMMAR

group them as it makes sense to me to do so. This is different than what is found in thenext data type, the data frame.

2.4.12 Data Frames

Data frames are kind of like lists in that they can have named items within them, how-ever, it is easiest for me to think of a data frame as a spreadsheet. It has rows of items,and each row has one or more columns. As in a spreadsheet, each column has a variablename (say height or NumberOfBumps). There is an inherent relationship between the columnsof data that have the same row in that it is an observation of some sort. This is thedistinction between data frames and lists, the ith row of a data frame can be considereda single observation across all columns of variables.

Typically when I load data into R from an external source, you do so by creating a dataframe. There are other ways to load data but I find this to be the most convenient. Thetopic of data frames is large enough such that I will delay discussion of it until Chapter3 when we discuss it depth and provide some analogies to how a data frame is like adatabase.

2.5 Operators

R recognizes proper orders of operation for mathematical expressions. As in normalnotation, you can override the normal order of operations by using parenthesis in ap-propriate areas. What follows is a brief discussion of some basic kinds of operators.

2.5.1 Assignment Operators

As described above, assignments are made using the assignment operator, <- and ac-tually can be assigned the other way with the operator ->. Examples of assignmentsinclude:

> x <− 23> 56 −> y> x[1 ] 23> y[1 ] 56

Again, it is important to note that (a) under assignment, there is nothing printed outform the R engine, and (b) to see the value of a variable, just type its name on thecommand line.

2.5.2 Numerical Operators

Numerical operators are defined as operations on variables. These include the normalset of operators including addition (+), subtraction (-), mutliplication (), division (), and

Biological Data Analysis Using R

2.5. OPERATORS 19

exponents (). Examples of these operators are:

> x∗2[1 ] 46> y−5[1 ] 51> x−y[1 ] −33> xˆ2[1 ] 529> x/y[1 ] 0.4107143> x[1 ] 23> y[1 ] 56

Notice here that these expressions did not change the values of the variables becausethere was no assignment involved.

2.5.3 Logical Operators

Often times we need to run comparisons between variables. These operators determinethe true of a statement and return a boolean (e.g., TRUE or FALSE). Operators includeequality (==; notice this is two equals signs), explicit relations (< and >), range rela-tions (>= for equal to or greater than and <= for less than or equal to), and inequality(! =).

> x <− 23> y <− 56> x==y[1 ] FALSE> x<y[1 ] TRUE> x>y[1 ] FALSE> x>=y[1 ] FALSE> x !=y[1 ] TRUE> y<=x[1 ] FALSE

These operators are commonly found in conditions but can also be used to select asubset of values from a data vector (see ??).

Biological Data Analysis Using R

20 CHAPTER 2. LANGUAGE & GRAMMAR

2.6 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• class(x) This function will return the kind of variable that x is. We have been usingit all along in the discussions of data types but you will probably not use it verymuch.

• dim(x) This function returns the dimension of x. This function returns the numberof rows and columns in x which is appropriate for matrices and data frames. Theresult is returned as a vector of length=2 with the number of rows in the first indexand the number of columns in the second. This function will return NULL for allother data types.

• length(x) This will return the length of x which means different things dependingupon the kind of variable that x is.

– If x is an integer, numeric, logical, character, complex, raw, or logical, thisfunction will return the number of distinct items in x. Essentially, this will tellyou if you have a single data point or a vector of data points. Remember thedefault constructors of these data types allow you to make a vector of item soit treats all these data types as a vector and returns the length of the vector.

– If x is a list or a data frame then it will return the number of variables in thatlist or data frame. For example, assume that theList is defined as in 2.4.11,then length(theList) would return the number 3.

– For matrices the function returns the number of elements in the matrix. So amatrix with 3 rows and 2 columns would have a length of 6.

• paste(x,y) This function concatenates items into a character string. By default, thisfunction puts a space between the items in x and y, although you can change thisbehavior by setting a value for the optional sep parameter passed to the function.

• rep(x,n) This function repeats the value x a total of n times and returns it as avector.

• seq(f,t,by=b) This function returns a sequence of numeric types from f to t by b.

• summary(x) This function will return an overview of the variable x.

– If x is contains numerical values then it will provide the following quantitativemeasures: Minimum, 1st Quantile, the Median, the Mean, the 3rd Quantile,and the Maximum.

– If x is a list or data frame then it provides a summary of each variable in x.

Biological Data Analysis Using R

2.7. EXERCISES 21

2.7 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. Create two variables, x and y of type integer using the as.integer function and assignthem the values of 4 and 5. Add the x by y and store the result in a third variablenamed z. What kind of variable is z?

2. Create two variables, x and y of type integer using the as.integer function and assignthem the values of 3 and 2. Divide the x by y and store the result in a third variablenamed z. What kind of variable is z? Whis is this different than the answer in theprevious question?

3. Coerce x <- 23 into other data types to see which are amenable using the as.∗ functionsfor each data type.

4. What numeric values are considered TRUE when coerced into a logical data type usingthe function as.logical ()?

5. Create a sequence of numbers ranging from 1− 10 by 0.1 and assign it to the variablex.

6. Create a sequence of numbers from 100 down to 50 by 2 and assign it to the variabley.

7. Turn the vector of character items "Control", "Control", "Control", "Ear Removal", "Ear

Removal", "Ear Removal", "Ear Removal", "Fake Ear Removal", "Fake Ear Removal", "Fake Ear Removal",

"Fake Ear Removal" into a Factor variable and create a table from it to show the numberof entries in each treatment.

8. Create a vector of character variables that contains 25 ”a”, 15 ”b”, and 58 ”c” in-stances. What is the length of this vector? Create a table from the entries.

9. Create a variable that is a list. In the list add variables for your name, email address,and height.

10. How is a data frame different than a list?

Biological Data Analysis Using R

22 CHAPTER 2. LANGUAGE & GRAMMAR

Biological Data Analysis Using R

Part II

Biologically Motivated Topics

23

Chapter 3

Data Frames

In this chapter we will be learning about data frames and how we can use them toour benefit. Data frames are useful as they are a single object within which we canstore data (to disk or databases), perform statistical analyses, and perform complicatedselections.

In my interactions with R , the vast majority of time that I spend working with datathat is contained with a data frame. This is because I typically keep my data in eitherspreadsheets or in databases, both of which force me to coerce my observations intosomething like:

Population,Height,SexA,23.4,FemaleA,32.9,FemaleA,29.7,FemaleA,38.2,MaleA,32.7,MaleB,28.4,FemaleB,27.3,MaleB,27.7,MaleB,30.1,Female

This format is relatively rigid but is amenable to several types of observations. The firstrow is a header row with the name of each variable spelled out. The second and allsubsequent rows are observations with a value for each column of data. Columns ofdata are also separated by some kind of delimiter. Here I am using a comma (and the fileis probably saved as a csv file) but tabs, spaces, and other characters can also be used.For the rest of this chapter, we will use the data above as an example to show how tointeract with and manipulate data frames.

In this Chapter you will learn the following skills:

• Enter data into a data frame.

• Load a data frame from an existing file.

• Save a data frame to a file.

25

26 CHAPTER 3. DATA FRAMES

• Manipulate data within a data frame.

• Perform complex queries and joins on data frames.

3.1 Data Input/Output

Data can be input into data frames in two different ways; you can enter it directly orload it from an external file. The former method is good if you have just a little bit ofdata whereas the later is probably better if you have persistent data.

3.1.1 Entering Data Directly

After reading Chapter 2 discussing different data types should be all you need to under-stand how to put data in manually. To recreate the example data set you could:

> Pop <− c ( "A" ,"A" ,"A" ,"A" ,"A" ,"B" ,"B" ,"B" ,"B" )> Ht <− c (23.4 ,32.9 ,29.7 ,38.2 ,32.7 ,28.4 ,27.3 ,27.7 ,30.1)> Sx <− c ("Female" ,"Female" ,"Female" ,"Male" ,"Male" ,"Female" ,"Male" ,"Male" ,"Female" )

Once you have these variables entered into R , you can put them into a single data frameby:

> myData <− data . frame ( Population=Pop , Height=Ht , Sex=Sx )> myData

Population Height Sex1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female> summary(myData)

Population Height SexA:5 Min. :23.40 Female:5B:4 1st Qu.:27.70 Male :4

Median :29.70Mean :30.043rd Qu.:32.70Max. :38.20

Notice how the data are already numbered by observation. The names that you pass tothe data.frame() function will be the names of the variables in the data frame and the namesof the variables you previously defined for them will be thrown away (e.g., there is not avariable named Pop in myData).

Once you have created a data frame, you can access elements within it as you would fora list (and even as a matrix to some extent).

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 27

3.1.2 Loading Data From A File

It is relatively common for you to already have data on hand and it is a bit of a wasteof time for you to re-enter the data into R (this would also cause a high probability oferrors as you type these values in). Getting data into R is pretty easy.

The data format of the file is a relatively important item. There are methods available toimport normal Excel files into R but will not go into them because the file format for thisprogram changes with each release and it is not portable across platforms (e.g., there isno Excel on unix). Moreover, there are a lot of other places that you can get data suchas online databases, data loggers, etc. and a more general approach will be followedhere.

I will assume that you can get your data file into a text format. What matters for theimport are the following items:

1. Does the data have a row of variable names (headers) in the first row? If you do nothave a row of headers then R will assign them as V 1, V 2, . . ..

2. What character do you use to separate columns of data? Is it tab, space, comma,or some other character that separates you data columns?

3. Do you have any items that are in quotes? Some programs will output text wrappedin quotes. This is not that common but you should be aware of it.

4. You need to either have the data file in the same directory that you are workingin when you started R or know the full path to the file (e.g., /Desktop/data.txt orC:Whatever).

It is important for you to realize that the data you enter into a data frame have to have Note!the same number of data columns for every observation. In the example data file above,there are three observations for each row. If you do not have the same number ofobservations for each row, R will barf up some errors. Be careful here, some times whenyou export from a particular spreadsheet program (that shall remain nameless) you canget extra columns of data that will screw up your import. You may want to open the textfile in a text editor to look to make sure if you get some odd errors. If you forget to addone of the additional options to the read.table() function, R may actually load the file butit won’t be as you expect. For example, it the example below where I did not tell R thatthe data file uses a comma as a column separator, it loads every row as a single textobservation (and considers it a factor) rather than three column of data.

> data <− read . table ("DataFrame1.txt" ,header=T )> data

Population . Height .Sex1 A,23.4 ,Female2 A,32.9 ,Female3 A,29.7 ,Female4 A,38.2 ,Male5 A,32.7 ,Male6 B,28.4 ,Female7 B,27.3 ,Male8 B,27.7 ,Male9 B,30.1 ,Female> data [1 , ][ 1 ] A,23.4 ,Female

Biological Data Analysis Using R

28 CHAPTER 3. DATA FRAMES

9 Levels : A,23.4 ,Female A,29.7 ,Female A,32.7 ,Male . . . B,30.1 ,Female> data <− read . table ("DataFrame1.txt" ,header=TRUE, sep="," )> data

Population Height Sex1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female> summary( data )

Population Height SexA:5 Min. :23.40 Female:5B:4 1st Qu.:27.70 Male :4

Median :29.70Mean :30.043rd Qu.:32.70Max. :38.20

The options passed to the read.table() are the file name (with path if necessary), a header

parameter (TRUE or FALSE) to indicate if the data file has a header row, and sep to indicatewhat character is used for a separator. Other separators are tab (indicated as sep="\t")and as space sep="". Baring any errors that I made in typing in the data in the last section(3.1.1) the printing of the data frame should be identical.

3.1.3 Adding Data To An Existing Data Frame

Once you have a data frame in R , you an add data to it relatively easily using. To addadditional rows of data you use the function rbind() (as in row bind). What you add to thedata frame must be another list or data frame that has the same variables in it as inyour original data frame. If you do not have all the variables in the thing you are addingR will give you an error. Here is an example.

> rbind ( data , data . frame ( Population="B" ,Height=31.3,Sex="Female" ) )Population Height Sex

1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female10 B 31.3 Female> data

Population Height Sex1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 29

Notice that the addition of the data B 31.3 Female items were not retained in the dataobject. That is because this function does not change the data frame that is passed toit, rather it returns a brand new data frame that is identical to the original one but hasthe additional data appended on the bottom. If you want to permanently change yourexisting data frame then you need to use the assignment operator as:

> data <− rbind ( data , l i s t ( Population="A" ,Height=32,Sex="Male" ) )> data

Population Height Sex1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female10 A 32.0 Male

To add additional columns of data you use the function cbind() (as in column bind). Thisamounts to adding another variable to all the observations in your current data set.Again, for this to work, you should provide as many items as there are rows of data inthe data frame.

> cbind ( data , l i s t ( SizeClass = c (1 ,1 ,1 ,2 ,2 ,1 ,2 ,2 ,1 ,2)) )Population Height Sex SizeClass

1 A 23.4 Female 12 A 32.9 Female 13 A 29.7 Female 14 A 38.2 Male 25 A 32.7 Male 26 B 28.4 Female 17 B 27.3 Male 28 B 27.7 Male 29 B 30.1 Female 110 A 32.0 Male 2

Again, if you want to make the additions to your data frame permanent then you needto use the assignment operator.

> data <− cbind ( data , l i s t ( SizeClass = c (1 ,1 ,1 ,2 ,2 ,1 ,2 ,2 ,1 ,2)) )> data

Population Height Sex SizeClass1 A 23.4 Female 12 A 32.9 Female 13 A 29.7 Female 14 A 38.2 Male 25 A 32.7 Male 26 B 28.4 Female 17 B 27.3 Male 28 B 27.7 Male 29 B 30.1 Female 110 A 32.0 Male 2

The reason that these two functions do not change the data frame that you passed tothem is because you may want to make a temporary data frame with some additionalvariables or copy the data frame

Biological Data Analysis Using R

30 CHAPTER 3. DATA FRAMES

3.1.4 Copying Data Frames

To copy a data frame, use the assignment operator. This make a new copy of the dataframe that is independent. For example, in the listing below, newData is made as a copyof data. Then the Population variable for the first row is changed from A to B. Notice howchanges to newData are independent of entries in data.

> newData [1 , ]Population Height Sex SizeClass

1 A 23.4 Female 1> newData[1 ,1] <− "B"> newData [1 , ]

Population Height Sex SizeClass1 B 23.4 Female 1> data [1 , ]

Population Height Sex SizeClass1 A 23.4 Female 1

3.1.5 Removing Data From A Data Frame

How you remove items from a data frame depends upon if you are removing columnsor rows of data. To remove a row of data (e.g., a whole set of variables for a singleobservation) you an use a negative sign in front of the index.

> data[−10,]Population Height Sex SizeClass

1 A 23.4 Female 12 A 32.9 Female 13 A 29.7 Female 14 A 38.2 Male 25 A 32.7 Male 26 B 28.4 Female 17 B 27.3 Male 28 B 27.7 Male 29 B 30.1 Female 1> data

Population Height Sex SizeClass1 A 23.4 Female 12 A 32.9 Female 13 A 29.7 Female 14 A 38.2 Male 25 A 32.7 Male 26 B 28.4 Female 17 B 27.3 Male 28 B 27.7 Male 29 B 30.1 Female 110 A 32.0 Male 2

Again, this returns a data frame without the given index. If you want to make thispermanent you must make an assignment as before. You can also pass an array ofindices to remove more than one at a time (see also the function subset() in 3.3.1).

> data <− data[−10,]> data

Population Height Sex SizeClass1 A 23.4 Female 12 A 32.9 Female 13 A 29.7 Female 1

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 31

4 A 38.2 Male 25 A 32.7 Male 26 B 28.4 Female 17 B 27.3 Male 28 B 27.7 Male 29 B 30.1 Female 1> data[−c (2 ,4 ,6 ,8) , ]

Population Height Sex SizeClass1 A 23.4 Female 13 A 29.7 Female 15 A 32.7 Male 27 B 27.3 Male 29 B 30.1 Female 1

Deleting a column of data can also be accomplished by the same manner or by assigningthe variable the value of NULL.

> data <− data [ ,−4]> data

Population Height Sex1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female> data$Sex <− NULL> data$Population <− NULL> data

Height1 23.42 32.93 29.74 38.25 32.76 28.47 27.38 27.79 30.1

3.1.6 Saving Data Frames to Files

There comes a time when you have to save some data you have been working on. In fact,it is quite often. There are several ways to save data in R . First, you can have R saveevery variable in memory. When you quit R using the q() function, it will ask if you wantto save:

> q ( )Save workspace image? [ y/n/c ] : y

If you do, there will be a .RData file saved in the directory you are working with thatcontains all the data you currently have in memory. When you restart R , it will loadthese data back into memory for you. Fairly easy and direct way of getting your data todisk and back and it is cross-platform. If you are going to use this kind of data saving,you should create a new folder for any data set you are working with. This will keep

Biological Data Analysis Using R

32 CHAPTER 3. DATA FRAMES

the raw data file(s) in the same location as the data entered and formatted in R . Themain drawback to this is that the name of the saved data file (.RData) starts with a period(.) and will therefore be invisible to you when you look in the folder with your normalFinder, File Browser, or whatever. You can easily overwrite it or throw it away since itisn’t immediately visible. It is also a bit inefficient in that if you have a bunch of othervariables in memory you may not want to save them all. If I just merged a bunch of dataframes (see 3.3.2), I may only want to save the final data.

The second way that you can save your data frame is to save the data frame directly.This allows you to save different data frames with different names and you can savethem where ever and named what ever you like.

> save ( data , f i l e ="MyNewSavedData.Rdata" )

You can also save several variables at once by passing their names as a list to the save()

function. Here is an example:

> g <− 1:20> otherData <− factor ( c (T ,T ,T ,T ,F,F ) )> g

[1 ] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20> otherData[1 ] TRUE TRUE TRUE TRUE FALSE FALSELevels : FALSE TRUE> save ( l i s t =c ("data" ,"g" ,"otherData" ) , f i l e ="DataType2.RData" )

It is common for saved data from R to have the file suffix of .Rdata so lets not bucktradition...

Once you have saved the data frame, you can load it back into memory at any timeby:

> l s ( )[ 1 ] "data> rm(data)> ls()character(0)> load("MyNewSavedData.Rdata")> ls()[1] "data"

Notice here I use ls () to see what is in memory, rm() to remove data from memory (andcheck, then reload the data using the load() function.

3.1.7 Deleting Data Frame

Removing a data frame from memory is no different than removing any other variable.You simply use the rm() function as:

> rm( data )

If you have a lot of different data files in memory, you can delete them individually, as agroup, or delete everything in memory at once as shown below:

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 33

> l s ( )[ 1 ] "elvis genotypes" "kent hovinds secret data"[ 3 ] "myCoolData" "x"[ 5 ] "y" "yourNotLooserData"> rm("x" )> l s ( )[ 1 ] "elvis genotypes" "kent hovinds secret data"[ 3 ] "myCoolData" "y"[ 5 ] "yourNotLooserData"> rm( l i s t =c ("y" ,"myCoolData" ) )> l s ( )[ 1 ] "elvis genotypes" "kent hovinds secret data"[ 3 ] "yourNotLooserData"> rm( l i s t = ls ( ) )> l s ( )character (0 )

To delete individual variables, you must name them but when you delete several vari-ables you need to tell the rm() command that you are going to pass it a list of variablenames to delete (the list =) parameter. The final example shows you how you can tell itto delete everything in memory (e.g., delete this list and this list is all the data that arecurrently in memory.

3.1.8 Components of a Data Frame

A data frame has a few distinct components in addition to the data points. Using thefunction attributes() shows the things that are make up a data frame. This function returnsa list containing the variables names, class, and row.names.

> attributes ( data )$names[1 ] "Population" "Height" "Sex"

$class[1 ] "data.frame"

$row.names[1 ] 1 2 3 4 5 6 7 8 9> dataAttributes <− attributes ( data )> dataAttributes$row.names[1 ] 1 2 3 4 5 6 7 8 9

There are also other ways to access these attributes. In Chapter 2, you were introducedto the class(x) function and we will not need to go over that again here. There are corre-sponding functions names(x) and row.names(x) that you can easily use to get access to thesecomponents of a data frame. You can also use these functions to assign new values toan existing data set. For example:

> dataPopulation Height Sex

1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male

Biological Data Analysis Using R

34 CHAPTER 3. DATA FRAMES

8 B 27.7 Male9 B 30.1 Female> names( data ) <− c ("Group" ,"DistanceFromGround" ,"Gender" )> data

Group DistanceFromGround Gender1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female> row .names( data ) <− seq (9 ,1 ,by=−1)> data

Group DistanceFromGround Gender9 A 23.4 Female8 A 32.9 Female7 A 29.7 Female6 A 38.2 Male5 A 32.7 Male4 B 28.4 Female3 B 27.3 Male2 B 27.7 Male1 B 30.1 Female

3.2 Slicing

Grabbing portions of your data frame is pretty easy. Below are some examples of howyou can access some of your data components:

> data [ ,1 ][1 ] A A A A A B B B BLevels : A B> data [ ,2 ][1 ] 23.4 32.9 29.7 38.2 32.7 28.4 27.3 27.7 30.1> data [1 :4 , ]

Population Height Sex1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male> data$Sex[1 ] Female Female Female Male Male Female Male Male FemaleLevels : Female Male> data$Population[1 ] A A A A A B B B BLevels : A B

Here are some rules that you need to keep in mind:

1. To access a data frames items by index, you use the square brackets [] along withthe indices of the components separated by a comma ,.

2. R uses indices for all its data types in what is called row major format. That is tosay that the first index is for the row and the second index is for the column. Forexample data[1,2] will provide access to the 1st row and the 2nd column.

Biological Data Analysis Using R

3.3. COMPLEX SELECTIONS 35

3. To get all the items in a given row or column you can leave out the index. Forexample, the command data[i,] returns all rows of data from the ith row whereasdata[,j] returns the data in all rows for the jth column.

4. You can also index the data for a particular column by calling its name. For exam-ple, the example data set has variables named Population, Height, and Sex. You canget all the data in one of these variables by using the notation data$VariableName as indata$Population.

5. To get a range of values on one or the other index such as the 2nd through 5th

entries in the height variable you put the range of indices separated by a colon asin data[2:5,2]. You can also combine this with the naming of the variables, whichmay be able to make it a bit easier to read, as data$Height[2:5]. This can workin both directions, as shown above when retrieving all the data for the first fourrecords (data[1:4,]).

3.3 Complex Selections

R data frames can be thought of as pseudo databases. There is a standard language thathas been adapted by both the American National Standards Institute (ANSI) and later theInternational Organization for Standardization (ISO). If you ever interact directly with adatabase, you use the Standard Query Language (SQL) to interact with the data. R doesallow you to interact with databases through one of its many database libraries but I willnot be covering that in this chapter. However, if you are familiar with some basic SQLoperations you will find this section rather easy. If not, I will be spending a little extratime trying to convince you that it is probably in your best interest to understand how toquery your data frames because it gives you a lot of power and flexibility. After all, beingagile with your data is a key skill I hope you will be learning in this course.and show youhow to use a data frame as a lite-database. Even if you do not ever use a database, thissection is really important as it will allow you to think about interacting with your datain interesting and complex way.

To understand SQL you need to understand that in a database data is contained withintables. And tables have rows and columns of data, just like a data frame. Each table alsohas a name. You can think of a database table as a worksheet in a spreadsheet programif that helps (though real database gurus are probably cringing as they read that). TheSQL language is very easy to understand and I will partition this section into commandsthat query the database and those that create new data frames by the combination oftwo or more existing data frames that have a common data column.

3.3.1 Queries

Queries are essentially what we have been doing in 3.2 with indices so I won’t go overthe basic stuff that we have already covered other than to show the SQL equivalentsin case you need to know them. I will however delve a bit into how the function subset()

works because it is pretty powerful.

Biological Data Analysis Using R

36 CHAPTER 3. DATA FRAMES

To select all observations in SQL, you use the statement SELECT * FROM tableName, which inR is simply what we have been doing by tying the name of the data frame (hereafter Iwill use data to refer to the name of the table for similarity with our previously loadeddata frame).

> dataPopulation Height Sex

1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female

In these SQL statements I use words in all capitol letters to indicate SQL languagecomponents and lowercase words to indicate table names or variables. Also, in SQL theasterisk means ”everything” (as in all variables).

The strength of SQL and databases lies in the fact that you can do complicated selectionsfrom the tables. For example, in SQL you can select by row number and column numberusing the statement SELECT * FROM data WHERE rownum==x AND colnum==y. Using the logical oper-ator AND adds a lot of power to this statement. However, in R we have been doing thisusing the indices directly and the square bracket notation as (with x = 1 and y = 2):

> data [1 ,2][1 ] 23.4

Several rows or columns can be selected in SQL by SELECT * from data WHERE rownum>=5 AND

rownum<=7 is accomplished in R as:

> data [5 :7 , ]Population Height Sex

5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male

To get only a subset of the variables in each row, you can indicate which variables youare interested in selecting in SQL as SELECT height, sex FROM data and in R we can eitherslice both indices as:

> data [ ,2 :3 ]Height Sex

1 23.4 Female2 32.9 Female3 29.7 Female4 38.2 Male5 32.7 Male6 28.4 Female7 27.3 Male8 27.7 Male9 30.1 Female

Or we can use the subset() function as:

Biological Data Analysis Using R

3.3. COMPLEX SELECTIONS 37

> subset ( data , se lect=c ("Height" ,"Sex" ) )Height Sex

1 23.4 Female2 32.9 Female3 29.7 Female4 38.2 Male5 32.7 Male6 28.4 Female7 27.3 Male8 27.7 Male9 30.1 Female

Often times you will have rather large data sets in R that you will be working with andit may be easier to grab parts of your data set by using names of variables rather thanby using column indices (it is up to you).

You can also get a bit more specific and only look for components in your data set usingrelational operations. For example, the SQL statements SELECT * FROM data WHERE height>30

and SELECT * FROM data WHERE height>30 AND columnnum==2 is accomplished in R by:

> data [ data$Height>30,]Population Height Sex

2 A 32.9 Female4 A 38.2 Male5 A 32.7 Male9 B 30.1 Female> data [ data$Height>30,2][1 ] 32.9 38.2 32.7 30.1

Notice how in the last example here I mixed the use of selecting subsets of observationsusing the relational operator > and subsets of column using the numeric index. Alsonotice how using the 2 in the position after the comma gives only the second column ofdata.

You can combine conditions in a SELECT-like query such as SELECT * FROM data WHERE height>30

AND sex="Male" by using the unary & operator as:

> data [ data$Height>30 & data$Sex=="Male" , ]Population Height Sex

4 A 38.2 Male5 A 32.7 Male

This complicated statement needs to be dissected to reduce confusion. The part in thesquare brackets [] consists of the stuff on the left side of the comma (data$Height>30 &

data$Sex=="Male") and the stuff on the right side (which happens to be empty in this case).There are some things to remember when doing compound statements like this:

1. The & operator in between the requires that the things on both sides of it are TRUE.

2. The equality operator == must be a double equals sign.

3. I find it easy to take a few passes at these compound statements to make sure I amgetting them correct.

In addition to the AND operator in the SELECT statements you there is also an OR operator. Itis valid to say in SQL SELECT * FROM data WHERE sex=="FEMALE" OR population=="A". This can alsobe done in R using the OR operator ‖.

Biological Data Analysis Using R

38 CHAPTER 3. DATA FRAMES

> data [ data$Sex=="Female" | data$Population=="A" , ]Population Height Sex

1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female9 B 30.1 Female

If the selection of subsets of your data become more complicated than this, you can useparenthesis to separate out conditions. This makes it easier for you to read and sinceyou are the one that will be writing this code and coming back later and looking at it, itpays to be as un-convoluted as possible. Here is a whack example from the SQL SELECT *

FROM data WHERE (population=="A" AND sex=="Female") OR (population=="B" AND height<30).

> data [ ( data$Population=="A" & data$Sex=="Female" )+ | ( data$Population=="B" & data$Height<30) , ]

Population Height Sex1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male

Note: I split the command across two lines at the OR operator. In R when you do this, itgives you the little + sign and you can continue typing as if it were on a single line. I hadto do this because the command is longer than the width of this paper...

3.3.2 Joins

OK, so now that you have everything you want to know about how to select stuff fromwithin a single data frame with an arbitrary level of complexity lets move into joins. Ajoin is an operation where you have two or more tables (or data frames) and you aregoing to create a new one based upon the merging of the two, provided that they bothhave a variable in them you can use as a common index. Here are two examples that wewill be using. The first table is the data table we have been working with thus far.

> dataPopulation Height Sex

1 A 23.4 Female2 A 32.9 Female3 A 29.7 Female4 A 38.2 Male5 A 32.7 Male6 B 28.4 Female7 B 27.3 Male8 B 27.7 Male9 B 30.1 Female

The second table is one that has characteristics of the Populations themselves. It is in theexample data sets and is called PopulationAttributes.txt and we can load it into R as:

> popData <− read . table ("PopulationAttributes.txt" ,header=T, sep="," )

Biological Data Analysis Using R

3.3. COMPLEX SELECTIONS 39

> popDataPopulation LongName State North East Elevation

1 A Richmond Virginia 37.53300 −77.4670 45.72 B Seattle Washington 47.60972 −122.3331 0.0

If you look at these two tables, there is the common variable Population. So in essence,I could add the data from popData and data to create a new data set that has all thisinformation. It is common in databases to have tables split like this. It saves space(imagine having the 5 extra data columns for each row in data, it would be repetitive andfor large data sets may max out the memory of your computer. It is also common to findbiologists who have programmed software to do some kind of analysis that requires youto put some kinds of data in one file another kind in a second file, etc. Joins allow youto take these different data frames and join them (catchy name, no?).

To join two tables you will use the function merge() on the data sets. In SQL this would beSELECT * FROM data, popData WHERE data.Population == popData.Population. Fortunately, is is a biteasier to do this in R , here is an example:

> merge ( data ,popData )Population Height Sex LongName State North East Elevation

1 A 23.4 Female Richmond Virginia 37.53300 −77.4670 45.72 A 32.9 Female Richmond Virginia 37.53300 −77.4670 45.73 A 29.7 Female Richmond Virginia 37.53300 −77.4670 45.74 A 38.2 Male Richmond Virginia 37.53300 −77.4670 45.75 A 32.7 Male Richmond Virginia 37.53300 −77.4670 45.76 B 28.4 Female Seattle Washington 47.60972 −122.3331 0.07 B 27.3 Male Seattle Washington 47.60972 −122.3331 0.08 B 27.7 Male Seattle Washington 47.60972 −122.3331 0.09 B 30.1 Female Seattle Washington 47.60972 −122.3331 0.0> class (merge ( data ,popData ) )[ 1 ] "data.frame"

As you can see, it returns a new data frame with all the data included. I think this hasgotten you enough exposure so that you can probably be dangerous. The best way to getcomfortable with these methods is to actually use them.

Biological Data Analysis Using R

40 CHAPTER 3. DATA FRAMES

3.4 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• cbind(x) This function binds a column onto the right side of x. This only works withsome kinds of data types (e.g., those where an operation of appending on a columnof data makes sense).

• rbind(x) This functions binds a row of data onto the end of x. Again for those datatypes that this operation makes sense.

• load(x) If x is the name of a .Rdata data file then it will load the contents into memory.

• merge(x) This function takes two data frames and merges them on a common variablename. If there are more than one common variable name you can specify whichone and if there are no commonly named variables then you are out of luck (unlessyou have variables that hold the same data but are just named differently).

• rm(x) This function removes x from memory. Gone. Auf wiedersehen. Can’t get itback.

• save(x,filename=y) This function saves the R object x to file named y.

• subset(x) This function returns a slice of your data frame where you can specifywhich variables to use. You can also do this with creative use of conditional opera-tors and variable names.

Biological Data Analysis Using R

3.5. EXERCISES 41

3.5 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. Create three different variables, a logical one, one that is a numeric type, and a vectorof characters. Use these to create a data frame named theData.

2. In the folder for this Chapter there is a text file named GuinneaPigData.csv. Load it intomemory and print out a summary.

3. How do you indicate a missing data point in a data file?

4. Add a numeric data column to the existing data frame, theData. Provide a summary ofthe data.

5. How would you save the data frame, theData, to a file named newData.Rdata.

6. What is the difference between row major indexing and column major?

7. Using index numbers, select the 2nd and 3rd rows of the data set theData.

8. Read in the data file PersonData.csv from the class data set. What kind of data type isthe variable Names? How can you change this to a character type and then change thename of the third entry in the data frame, theData, to Thomas?

9. Create a new data set with a two variables, one that is Order = −1:4 and the other that isHome=c("Olympia", "Juanita", "Centralia", "Tacoma", "Olympia", "Olympia"). Merge this data framewith the one named theData and assign it the name combinedData.

10. How would you perform a query of the combined data set to select all records thathave Order >= 3 or Home == ”Olympia”.

Biological Data Analysis Using R

42 CHAPTER 3. DATA FRAMES

Biological Data Analysis Using R

Chapter 4

Summary Statistics

In this chapter you will explore some of the methodologies that R has for describingyour data. R is an excellent platform for exploring data, looking at relationships amongvariables, and graphically portraying results.

In this Chapter you will learn the following skills:

• Learn about some common numerical distributions.

• Learn about commonly used statistical distributions.

• Understand parametric summary statistics.

• Explore non-parametric summary statistics.

• Use the table() function as an entry point into contingency table analysis.

• Create single and multiple line figures.

• Create histograms and density plots.

4.1 Distributions

R and its various sub-packages contain more numerical distributions than you willprobably ever need to use. Moreover, they provide them in a clear and concise inter-face that has a consistent format. To my knowledge, all the distributions provide thefollowing four components:

1. A density function that is of the form dNameOfDistribution (e.g., dnorm(), df () & dchisq()).

2. A distribution function that is called as pNameOfDistribution (e.g., pnorm(), pf () & pchisq()).

3. A quantile function named qNameOfDistribution (e.g., qnorm(), qf () & qchisq()).

4. A function that produces random numbers sampled from the distribution that isnamed rNameOfDistribution (e.g., rnorm(), rf () & rchisq()).

These are specifically helpful in a number of situations. For example, you may be run-ning a test and calculating a χ2 statistics on some table of data and want to know if the

43

44 CHAPTER 4. SUMMARY STATISTICS

value of your observed statistic, χ2Obs is large given the particular degrees of freedom that

you have at your disposal. Now typically, we have memorized due to the sheer numberof times that we have used it, what the critical value for a χ2 statistic with a single degreeof freedom should be (≈ 3.841459 right?). However, what if we have 8 degrees of freedomand χ2

Obs = 15.507? You could go find that old stats book on the shelf and page throughthe back of it to find the correct Appendix that has the right table (How do you readthose tables again?). Or you could use the various functions in R .

In this section, three aspects of using distributions within a statistical context will beintroduced. First, you will learn how to determine critical values for the χ2 distributionas used in formal hypothesis testing using the quantile functions. Then you will see howthe distribution function can tell you the probability of a particular estimation of the χ2

test statistic.

4.1.1 Finding Critical Values

In formal hypothesis testing, there is a specific test statistic that is proposed. Moreover,the estimation of a value for that statistic is compared to a known cutoff set by thedegrees of freedom in the model and the Type I error rate that you have chosen (e.g., theα value). For some reason, as a biologist we have settled on an α = 0.05 value to havesome kind of special meaning. Now, this is probably an over simplification of things thatwas used initially as a teaching aid for understanding the meaning of Type I errors. Thereis nothing intrinsically interesting about α = 0.05 and it is probably more informative forme to know the real probability of your calculated test statistic rather than if it exceedsdome arbitrary cutoff. I mean, is it really that different an interpretation if P = 0.049versus P = 0.051? That being said, lets jump into understanding how we find criticalvalues for some pre-defined value for α in different distributions.

The most commonly used distribution observed as an undergrad is probably the χ2

distribution. The distribution itself is shown in Figure 4.1 for three different values forthe degrees of freedom. This and other statistical distributions require that you providethe degrees of freedom before it can give you any information.

For any one particular set for the parameters α and df , there is a defined cutoff. Thevalue of the cutoff is defined as the point along the x−axis at which there is 1− α of theare under the curve to the left of the point and α of the are under the curve from thatpoint and beyond. While this is a very non-technical definition but I think you get thepoint when you consider the α shaded region in Figure 4.2 and the 1 − α region that isunshaded.

To determine the critical value of the χ2 distribution you use the qchisq() function. If youwere to look up the signature of this function (by typing ?qchisq into R ), you would seethat it accepts the following options:

qchisq (p , df , ncp=0, lower . t a i l = TRUE, log .p = FALSE)

There are two required parameters for this function, p and df . You can tell by lookingat this signature that they are required because they do not have an = sign next tothem and a default value given. If a parameter has a variable=value format in a function

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 45

Figure 4.1: Values for the density functionfor the χ2 distribution with 1, 2, and 3 de-grees of freedom.

Figure 4.2: A graphical depiction of thecritical value of the χ2 distribution for α =0.05 and df = 3. The shaded region con-stitutes a proportion of the area under thecurve equal to α.

signature then the value will be assigned to variable if you do not give it a value when youcall it. Default values are very helpful and save a lot of typing on your part.

The parameter p is the 1 − α cutoff you are interested in finding. In the classic case,this would be 1 − 0.05 = 0.95. At first, it seems a little backwards to use 1 − α insteadof α but if we look at the graphical depiction of this distribution in Figure 4.2, we seethat the point in question is where we actually have 95% of the area under the curve andwe are interested in the extreme α portion. The next required parameter is df , whichcorresponds to the degrees of freedom. As shown in Figure 4.1, this parameter controlsboth the shape and location of the χ2 values.

There are several optional parameters that you can pass to the qchisq() function and Iwill briefly mention them here for completeness. If you are interested in a more in depthdiscussion of these parameters, look up the qchisq() function and read the documentation.The ncp=0 option specifies a non-centrality parameter allowing you to get the critical val-ues for a non-central χ2 distribution. The lower.tail=TRUE indicates that you are interestedin the p proportion of the data in the lower tail of the distribution (e.g., P [x < 1−α]) ratherthan the the 1 − α portion of the other side of the distribution (e.g., P [x > 1 − α]). Thedefault value here is what we expect since we are interested in finding the α proportionon the right side of the distribution not on the left side of the distribution (which wouldbe all the values less than or equal to 0.03518). Finally, the log.p=FALSE option allows youto query using the log of p rather than p directly.

There are several other statistical distributions that you can query in R for particularcritical values. Common ones that you will be playing with in the Exercises portion of

Biological Data Analysis Using R

46 CHAPTER 4. SUMMARY STATISTICS

this chapter include Students t from qt and Fishers F from qf.

Scatter & Line Plots

Creating a simple plot of a line (or points in a sequence) is accomplished using the plot ()

function. This function has a signature (e.g., the things that you can pass to the functionand the things it expects) is:

plot ( x , y , . . . )

This listing is not very informative! Don’t worry, they get more interesting as we go along.The plot () function is kind of a dummy function that allows you to plot lots of differentkinds of things, and if things can be plotted, they should know how to plot themselves.Well, that is the theory at least. Lets jump into this graphing stuff by staring off witha more basic approach to creating graphs and building up to what we see in Figure4.1.

When you begin to create a plot, there are some default characteristics of the plot thatyou may want to override. For example, the R code plot( rnorm(10) ) produces the graphshown in the leftmost panel of Figure 4.3 consisting of a sequence of 10 random pointsselected from a normal probability distribution (we will discuss these random functionslater in Section 4.2). If you try it will look different that is why they are random...

The function rnorm(x) returns x random numbers selected from a normal probability dis-tribution with µ = 0 and σ = 1.0 (you can change these values, check the documentationon this function using the ?rnorm command). When you look at this plot, it is rather plainand does not convey any more information than 10 little circles. It may be of interest toyou to be able to change some of the properties of this plot. For example, you may wantto modify:

• The shape of the symbols

• The color of the symbols

• Add a line to connect the symbols, and perhaps modify the color, width, shape ofthat line.

• Provide more meaningful axis labels.

• Remove the box around the plot (my pet peeve)

To do this, we must understand what a graph consists of, how to access the variouscomponents, and how to find more information on the appropriate levels that can be setto these components. This chapter will be very long because it takes a lot of page realestate to show a graph, but I think you’ll be happy with the results when you can whipout a nice looking graph of your data. When possible, I will use random numbers tocreate these graphs so as you go through and attempt to recreate them, yours will lookslightly different than mine.

To customize any of these values, you need to pass additional information to the plot ()

function. This is what the ... part of the function signature that is shown above. Table

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 47

4.1 shows a list of additional commands that can be passed to the plot () function tocustomize plot appearances.

Here are some examples of how you would use some of these optional parameters withgraphs shown in Figure 4.3:

> plot ( x , y , xlab="X Label" , ylab="Y Label" ,pch=3, col="green" , bty="l" )> plot ( x , y , xlab="X Label" , ylab="" ,pch=2, col="blue" , type="b" , bty="n" , lwd=2)> plot ( x , y , xlab="X Label" , ylab="" ,main="Title" ,sub="subtitle" ,pch=2, col="red" ,+ type="l" , bty="n" , lwd=5)

Figure 4.3: Some example graphs with alternate values for symbols, line types, widths, colors,and titles.

When creating complicated graphs, I find it easy to build them up incrementally. Startwith a plain plot () command to see what the output looks like. Then customize the labelsand titles and plot it again to see it. Then continue to add parameters and review theplot.

Biological Data Analysis Using R

48

CH

APTE

R4.

SU

MM

AR

YS

TA

TIS

TIC

S

Table 4.1: Some useful additional commands to customize the appearance of a figure. For a complete listing of possible values that can becustomized, try the ?par command.

Command Usage Description

bg bg="red" Colors the background of the figure the specified color.bty bty="x" Sets the style of the box type around the graph. Useful values are ”o” for

complete box (the default), ”l”, ”7”, ”c”, ”u”, ”]” which will make a box withsides around the plot area resembling the upper case version of these letters,and ”n” for no box (my preference)

cex cex=1.0 Magnifies the default font size by the corresponding factor.col col="blue" colors the line and symbols the given color.fg fg="blue" Colors the foreground of the image to the set color.lty lty=x Specifies the line type (0 = none, 1 = solid, 2 = dashed, 3 = dotted, etc.)lwd lwd=x Specifies the width of the line ( 1 = default ).main main="Title for Graph" Sets a title along the top of the graph.mfrow mfrow=c(nr,nc) Creates a matrix of plots that can potentially have a number of rows

(nr) and columns (nr; see 4.3.1 for example).pch pch=x Sets the symbol that is plotted on the figure.sub sub="Subtitle on Graph" Adds a subtitle just under main on the top of the graph.type type="x" Sets the plot type. Plot types can be ”p” for points (the default), ”l” for

lines and ”b” for both lines and points.xlab xlab="label for x-axis" Set the label on the x-axis.ylab ylab="label for y-axis" Set the label on the y-axis.

Biological

Data

An

alysisU

sing

R

4.1. DISTRIBUTIONS 49

Overlaying Plots

There are times where it is desirable to produce several plots on a common background(e.g., the different values for df in Figure 4.1. R allows you a lot of leeway to mix up dif-ferent types of graphs in the same plot (see Figure 11.4 for a rather complex combinationof images and plots overlayed on the same area).

To overlay two graphs, you use the par(new=T) command to tell R that the following com-mand is going to apply to the currently active graphics device. This function allows youto adjust a lot of different graphical parameters and the plotting of a new image onto anexisting one is only one of the things that you can adjust. For a full discussion of otheroptions that par() accepts type ?par in R . You use it as follows:

plot ( x1 , y1 )par (new=T )plot ( x2 , y2 )

This will take the plot for the second set of variables and plot it on the same graphicsdevice as the previous one. When you overlay more than one plot on the same graphingarea, you must take into consideration the different scales that the graphs have. Bydefault R will try to maximize the are that is being plotted by changing the defaultranges of the x− and y−axes. For example, if I have data such as: x1 = [0, 1, 2, 3]; y1 =[10, 11, 12, 13] and plot it will automatically scale the axes to have limits of xlim=c(0,3) andylim=c(10,13), which means that the x−axis will start and end at 0 and 3 and the y−axis willstart and end at 10 and 13.

This is what would be expected to happen and works nicely until you try to put anotherplot. If your other data has values of x2 = [11, 12, 13, 14]; y1 = [23, 22, 21, 20] and you try tosimply overlay the two plots by simply typing:

> x1 <− c (0 ,1 ,2 ,3)> y1 <− c(10 ,11 ,12 ,13)> x2 <− c(11 ,12 ,13 ,14)> y2 <− c(23 ,22 ,21 ,20)> plot ( x1 , y1 , col="red" )> par (new=T )> plot ( x2 , y2 , col="blue" )

You get the image shown in Figure 4.4.

There are several obvious issues with this image.

1. You cannot read the axis labels. The two images are put right on top of each otherand the axes are individually scaled to fit the data in each plot () command.

2. It is difficult to tell the relationship among the data. If you look at the raw data, thex1[1] 6= x2[1] but in the plot it appears that they are equal.

3. The labels on the axes are typed over each other.

To overcome these issues you need to first find the appropriate limits for the values inboth of the data sets and for both plot () statements, we need to set the xlim and ylim values(see Table 4.1) to the appropriate values. These appropriate values will tell R what the

Biological Data Analysis Using R

50 CHAPTER 4. SUMMARY STATISTICS

Figure 4.4: Plot of two data sets using the par(new=T command but not taking into considerationthe axis limits of the two data sets before plotting.

minimum and maximum values for the x− and y−axes should be. Here is some codethat does this.

> x1 <− c (0 ,1 ,2 ,3)> y1 <− c(10 ,11 ,12 ,13)> x2 <− c(11 ,12 ,13 ,14)> y2 <− c(23 ,22 ,21 ,20)> yLimit <− range ( c ( y1 , y2 ) )> yLimit[1 ] 10 23> xLimit <− range ( c ( x1 ,x2 ) )> xLimit[1 ] 0 14

Here I combined the y values for both data sets and used the range() function to tell mewhat the range of these values are. Then I did the same thing of the x values in bothdata sets. Now, if I make the plot, I can make it for each pair of x & y variables scalingthe axes so that both data sets will be displayed on the same Figure.

> plot ( x1 , y1 , xlab="X" , ylab="Y" , bty="n" , xlim=xLimit , ylim=yLimit , col="red" )> par (new=T )> plot ( x2 , y2 , xlab="X" , ylab="Y" , bty="n" , xlim=xLimit , ylim=yLimit , col="blue" )

Notice how the optional arguments xlim and ylim make sure the axes are scaled correctly(Figure 4.1.1). I also use the bty="n" because I just hate the box that it puts around theplot area by default and this option does not draw any box at all.

As long as you add a par(new=T) between each successive plot () command, you can add asmany plots to the same figure as you would like.

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 51

Figure 4.5: Plot of two variables on the same axis after correcting for the range of each data set.

Saving Images To Disk

While it is rather cool to be able to create rather hansom graphics in R it is entirelyuseless if you do not know how to save it for later use. You could take a screenshot ofthe image and then crop it down a bit but that is not quite the easiest method to usehere. Almost all the images in this book were created in R and I was able to save theminto a format that made it easy to import them into this document.

R considers the little popup window that shows your graph as a graphics device. De-pending upon which platform you are using (e.g., Linux, OSX, Windows), the kinds ofoutput you may be able to produce may change. At present the following types areavailable:

Device What receives these graphing commands

bmp A Windows bitmap devicecairo pdf A PDF device based upon the Cairo drawing librariesjpeg A JPEG bitmap devicepdf A PDF filepictex A LATEX graphics command filepng A PNG bitmap devicepostscript A postscript filequartz An OSX graphics windowtiff A TIFF bitmap deviceX11 A graphics window on a system running X-Windows (unix some OSX)

Table 4.2: Graphics devices for output of figures

Biological Data Analysis Using R

52 CHAPTER 4. SUMMARY STATISTICS

When you type the command plot () a graphics window pops up showing you the image ofthe figure. What is happening here is that R is looking for the default graphics deviceand if you have not specified one, then the default value of ”show it to the user as awindow” is use.

Creating The Plot And Saving To File: This is the method that I used for all the figures inthis text. I first created the figure to look the way that I wanted and then I had R copythe figure to a file. You should be aware that when you copy the image, it will only copythe ACTIVE graphics device. If you have more than one graphics window open, only one ofthem will say ACTIVE in the window title. Be careful of this or you could be copying thewrong figure.

Once you have the graphic the way you like, you can use the dev.copy() command to copythe current graphics device to a file. For this book, I have been saving all the images asJPEG files so I pass the function the device=jpeg option and then specify the name of thefile. If you want to save yourself some heartache down the road, use meaningful namesfor the graphics you create. You can quickly get a lot of different plots that you maywant to go through at some time in the future and it sure helps to have them namednicely.

> hist ( rpois (1000 ,2) , xlab="Counts" , ylab="Frequency" ,main="" , col=topo . colors (8 ) )> dev . copy ( device=jpeg , f i l e ="ColoredHistogramOfPoissonDistribution.jpeg" )jpeg

3> dev . o f f ( )X11cairo

2

Once the dev.copy() function is finished, you must call the dev.off () function to tell R that youare finished copying things to that particular file and you no longer want to keep it openand ready for subsequent graphing. The output after the dev.off () command shows whichgraphics device is now active and what kind of device it is (in general, you can ignorethis). The image produced from this plot is shown in Figure 4.6

I also passed the plot command the optional col=topo.colors(8). The function topo.colors(x) re-turns x evenly spaced colors from a palette that is used for plotting topo maps. There areother default palettes in R you can use (see ?topo.colors for a list) in coloring parts of yourfigures. By default, I new that the hist () function would return 8 bins of data from therpois(1000,2) distribution (I plotted it first and counted) so I added 8 evenly spaced colors tothe plot just to make it look a bit more cheesy.

Plotting Directly To A File: Plotting to a graph window and copying it to a file is notnecessarily the only way you can get your graphics saved. You could just write themdirectly to a file using one of the graphics devices listed in Table 4.2 without looking at itin a window. I find this less appealing since I would like to see what I am plotting beforesaving it, but if you are chugging through lots of data and creating hundreds of images,perhaps you would be better served to make the plots directly and view them later. Atany rate, here is how it is done.

jpeg ( )plot ( rnorm(1000) , xlab="index" , ylab="value" , bty="n" )dev . o f f ( )

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 53

Figure 4.6: Image of colored Poisson distribution that was copied from the graphics device to ajpeg file.

and R will open the a jpeg() graphics device. This device is generally a file in the local di-rectory that is named RPlotXXX.jpeg (where the XXX values are incremental numbers suchas 001, 002, . . .). Then when you call the plot () function it sends the plotting commandsto the image itself in the file.1 You can add as many plotting commands as you likeand it will continue to send them to the file you specified. When you are done, you canfinalize the image by calling dev.off () to turn of the graphics device. To change the defaultincremental numbering of the files, you can pass a file name to the jpeg() function (or anyof the other ones) as we did in the previous section using dev.copy().

4.1.2 What Probability?

The outcome of a statistical analysis is the estimation of a particular test statistic. Forexample, when you calculate a χ2 statistic, you need to look up a the probability that avalue as large or larger than the observed one is expected to occur. In 4.1.1 we deter-mined how to calculate the cutoff value from a particular distribution given a specified

1Actually it keeps them in a buffer and not in the file directly.

Biological Data Analysis Using R

54 CHAPTER 4. SUMMARY STATISTICS

Type I error rate (the α value). Here we are interested in not asking if our calculatedvalue exceeds some particular cutoff, rather we are interested in understanding whatthe probability of observing a value as large or larger than the one we see.

In keeping with the current examples from the χ2 statistic, we can determine the prob-ability associated with a particular estimation of χ2

Calc by using the distribution functionpchisq(). The arguments to pchisq() are almost identical to those for the qchisq() functiondiscussed in 4.1.1 with the exception that we do not pass it the 1−α as the first param-eter, rather we pass it the estimated χ2

Calc value and it will return the answer in terms ofP [X ≤ x]. For example:

> chiCritAt0 .05 <− qchisq (0.95 ,1)> pchisq ( chiCritAt0 .05 , 1 )[1 ] 0.95> pchisq ( 7.23 , 3)[1 ] 0.9350828

The functions qchisq() and pchisq() give us the opposite answers from each other with onetelling us what the critical value (or P [X <= x]), and the other takes a value for χ2 andtells us what the cumulative area under the curve up to and including that point.

4.2 Random Number Generation

There are often times when you need to generate some random numbers (playing poker,picking lottery numbers, etc.). Random numbers can be drawn from any of the distri-butions that are in R using the rdistribution function. For example, to draw a randomnumber from a normal distribution (N(µ, σ)) you would call the rnorm(x,\mu,\sigma) function.The parameters µ and σ signify the mean and standard deviation of the distribution fromwhich you are drawing. An example of how this influences the outcome, check out Figure4.2.

There are a large number of random number distributions that you can run across.Below are some commonly encountered ones:

Normal The normal distribution has a density function of P (x|µ, σ) = 1σ√

2πe−

(x−µ)2

2σ2 .

Exponential The exponential density has a continuous density function of P (x|λ) =1− e−λx.

Poisson The Poisson distribution is a discrete distribution whose density function isP (k|λ) = e−λλk

k! .

Later in the Exercises you will get to use some of these distribution.

Histograms

A histogram is a graphical display of data that has been tallied into bins (e.g., specificbuckets). How you define the bucket locations and sizes are up to you. You can specifythat there should be a specific number of buckets and R will make them equal sized, or

Biological Data Analysis Using R

4.2. RANDOM NUMBER GENERATION 55

Figure 4.7: Examples of the densities of two normal distributions; the red one is drawn from arandom normal distribution with default values of µ = 0 and σ = 1 and another in blue that hasµ = σ = 5.

you can define ranges yourself. The function signature for the hist () function by typing?hist in R :

hist ( x , breaks = "Sturges" ,freq = NULL, probabil i ty = ! freq ,include . lowest = TRUE, right = TRUE,density = NULL, angle = 45, col = NULL, border = NULL,main = paste ("Histogram of" , xname) ,xlim = range ( breaks ) , ylim = NULL,xlab = xname, ylab ,axes = TRUE, plot = TRUE, labels = FALSE,nclass = NULL, . . . )

There are several things we should notice about this function signature. First, this is thefirst time that we’ve looked into a particular function and seen all the options. You cansee that several of the parameters are given what we call default values (e.g., the =VALUE

portions). That way if we do not provide a particular value for a parameter such as main,it will fill it in for you.

Biological Data Analysis Using R

56 CHAPTER 4. SUMMARY STATISTICS

The first thing that you typically want to change in a graphic is the default values forthe axis labels and the title of the graph. It is not commonly accepted practice to providetitles on graphs for most publication-quality graphics, but some times it is helpful whenyou are putting together a talk or just analyzing the data and making graphics for yourown interpretation. To change the default values of the axis labels and set an empty titleyou would do the following (shown in Figure 4.8):

> hist ( rnorm(100) , xlab="My Defined Bin Categories" , ylab="Frequency" , main="" )

Figure 4.8: Histogram with labels and main title changed.

Again, I am using the function rnorm() to generate the data from a random normal distri-bution here. It is perfectly OK to give empty values to things like titles and such.

Density Plots

A density plot is one where the probability density is calculated and turned into a lineacross the domain rather than a histogram. Here I will combine the histogram anddensity plots to show how to overlay two graphs on the same values.

> data <− rpois ( lambda=5,n=1000)> den <− density ( data )> den

Call :density . default ( x = data )

Data : y (1000 obs . ) ; Bandwidth ’bw’ = 0.5061

x yMin. :−1.518 Min. :3.567e−051st Qu. : 2.491 1st Qu.:8.145e−03

Biological Data Analysis Using R

4.2. RANDOM NUMBER GENERATION 57

Median : 6.500 Median :3.973e−02Mean : 6.500 Mean :6.229e−023rd Qu.:10.509 3rd Qu.:1.219e−01Max. :14.518 Max. :1.689e−01> yrange <− range ( den$y )> xrange <− range ( den$x )> hist ( data , ylim=yrange , xlim=xrange , xlab="Value of Random Poisson" ,+ ylab="Frequency" ,main="" , probabil i ty=T, bty="n" )> par (new=T )> plot ( den , col="red" , lwd=2,xlab="" , ylab="" ,main="" , bty="n" )

Figure 4.9: Histogram of 1000 random numbers drawn from a Poisson distribution with the λparameter set to 5. The red line indicates the density of the values.

There are some things to point out with this plot.

1. I save the values of data as a variable because I needed to plot the same set ofrandom variables as a histogram and as a density plot. Had I not saved them, Iwould be using a different collection of random numbers for each plot and theywouldn’t match.

2. I used the function density() to calculate the probability density function for the valuesof data. The density() function has two components, an x variable and a y variable.The the probability density is calculated as a probability rather than as a frequencycount (as the .

Biological Data Analysis Using R

58 CHAPTER 4. SUMMARY STATISTICS

4.3 Descriptive Statistics

Descriptive statistics are valuable tools in understanding particular patterns in yourdata. For the purposes of this section, we will assume that your the experiments thatare producing your data yield one of two different data types. First, observations fromyour data could be considered random variables; a measurement that produces a realnumber. Examples of random variables may be body size, dissolved oxygen, availablelight, etc. A collection of random variables will be denoted as X with elements xi; i =1 . . . N (e.g., indexing across all N individual observations). The other kind of data wewill be examining here are categorical data. Your observations are grouped into distinctcategories and consist of relative counts of each category. Examples of this includestage-dependent demographic tallies, gender of your study organisms, some types ofgenetic data, disease prevalence, etc. Categorical data will be denoted as Y , consistingof K categories and the number of counts observed in each category will be referred toas yi; i = 1 . . .K.

There are two general properties of random variables that we will spend a little timediscussing because they form the basis of how we examine our data. First, the meanof a random variable, usually denoted by the symbol µ is a measure of the centraltendency of your variable (a center of gravity, so to speak). We are all familiar withthe concept of mean, but in a general sense, the mean is just one of several moments ofa distribution and now we turn to this particular moment and then discuss some of the”higher moments.”

4.3.1 Moments

There are several properties of random variables that we may be interested in estimating.Notice that here I used the term estimate rather than compute, this is on purpose. Wewill be making estimates of real parameters of the data and we do so because in mostcases we do not have all the data at our disposal. Rather, we have created a sample ofour data from which we make inferences. To get all the data, we would have to sampleEVERY single instance out there and in most cases this is not possible.

There are two common properties that you will probably recognize immediately (I hope)and use all the time. These are the mean and variance of the data and are estimated in Rusing the functions: mean() and var(). Figure 4.10 shows what is being measured by theseestimators. This figure was created using the density() function from rnorm(1000000).

The mean, shown by the dashed line and the symbol µ is located at the center of gravityof the data. In R, you can calculate the mean of the data by using the function mean().The image also shows the standard deviation (which is the square root of the varianceσ =√σ2) as indicated by the dotted line. R has a function for both the variance var(), and

the standard deviation sd().

There are two more measures of distributions that we should discuss while we are here.2

These are the skew and kurtosis of the distribution. In R these functions are not loaded2Actually all four of these measures are known as the first four moments of the distribution. The first for

moments, µk; k = 1 . . . 4 can be calculated by µk = E[(X − µ)k].

Biological Data Analysis Using R

4.3. DESCRIPTIVE STATISTICS 59

Figure 4.10: Example locations for first two moments of a Normal (N(0, 1)) distribution.

into memory by default and we must load the moments library to gain access to them. Toload these libraries type:

> l ibrary (moments)

If R gives you a warning, this means that the moments library is not installed by default.In this case, see Appendix B for instructions on how to add libraries to your installationof R.

The skew of a distribution is a measure of how ”pushed-over” the main lump of thedistribution (again not a very statistical definition here). Distributions can either have apositive or negative skew, compare the images in Figure 4.11

A distribution is said to have a negative skew if the direction of the longer tail is tothe left. In these cases the mean < median < mode. Conversely, a distribution has apositive skew if the tail is on the right and the mean > median > mode. Distributionswhere these measures are equal is said to not have any skew. Skew is estimated in Rusing the function skewness()

The kurtosis of a distribution is a measure of the ”peakedness” of a distribution. This

Biological Data Analysis Using R

60 CHAPTER 4. SUMMARY STATISTICS

Figure 4.11: Negative (left) and positive (right) distributions. In both of these examples the dottedline connects the mode of the distribution (the top peak) to the mean (on the x axis). The directionof this lean determines if the distribution has a negative (left) or positive (right) skew.

term comes from the Greek word kurtos that means ’bulging.’ A simple example of howkurtosis looks is found in Figure 4.12 with three different distributions (the normal,logistic, and uniform), each with a different level of kurtosis.

In general, the function for kurtosis is:

K =µ4

σ4− 3

The correction factor (the - 3 part of the equation is a normalizing constant that allowsthe kurtosis of a normal distribution to be equal to zero. Below are the raw data and thekurtosis estimates used in producing Figure 4.12.

> normData <− rnorm(100000)> logist icData <− r log is (100000)> unifData <− runif (100000)> kurtosis (normData ) − 3[1 ] −0.02320046> kurtosis ( logist icData ) − 3[1 ] 1.219505> kurtosis ( unifData ) − 3[1 ] −1.197009

The discrepancy here in the estimates showing the normal distribution not quite equal tozero is because the data were created by drawing random numbers rather then specifyingthe distribution directly. One benefit of the - 3 correction factor is that it allows you toquickly tell the different types of kurtosis by looking at the value of the estimate. Ingeneral, the following types of kurtosis are available:

Platykurtic Curves that have negative excess kurtosis (e.g., the kurtosis()−3 < 0).

Biological Data Analysis Using R

4.3. DESCRIPTIVE STATISTICS 61

Figure 4.12: Three distributions )exponential, normal, and logistic) showing different levels ofkurtosis.

Mesokurtic Curves that do not have excess kurtosis (e.g., the kurtosis()−3 = 0).

Leptokurtic Curves that have positive excess kurtosis (e.g., the kurtosis()−3 > 0).

The last summary statistic we will cover here is the range(), which returns a two-itemvector containing the minimum and maximum values. In fact, the range() function callsthe min() and max() directly. There is little to discuss about this particular set of func-tions...

Creating a matrix of Plots

It is often desireable to create more than one plot on a graphic but not overlayed ontop of each other as was explained in Section 4.1.1. To do this, we need to adjust oneof the graphics properties using the function par(). The property we need to change ismfrow=c(nr,nc). This will create a matrix of plots that has nr rows and nc columns.

An example of creating a matrix of plots is given in the code below and depicted in Figure4.13.

Biological Data Analysis Using R

62 CHAPTER 4. SUMMARY STATISTICS

Figure 4.13: Matrix of four plots created from random numbers sampled from the normal, pois-son, exponential, and the logistic distributions.

> par (mfrow=c (2 ,2 ) )> hist ( rnorm(100000))> hist ( rpois (100000 ,1))> hist ( rexp (100000))> hist ( r log is (100000))

Subsequent calls to plotting functions will ”reuse” this graphic figure and replot thegraphs in the nr x nc matrix. This graphic window will have the nr x nc matrix of plotsuntil it is either closed or you change the mfrow property to something else.

4.3.2 Non-Parametric Parameters

Non-parametric statistics are generally concerned with the analysis of data that doesnot make assumptions about the underlying statistical distributions. There are severalcommonly known non-parametric statistics such as the Binomial Test, Goodness of Fit,the Mann-Whitney Test, and the Kruskal-Wallis test. In this section, we will exploresome of the methods that R can use to describe data without assuming an underlying

Biological Data Analysis Using R

4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 63

distribution.

The first summary statistic outline here will be the quantile. While you have probably notheard of this particular descriptive statistic, you most likely will have run across termssuch as a median, quartile, or percentile. All of these are particular kinds of quantilesthat will be obvious when we consider the formal definition of a quantile.

Quantile A pth quantile is the value xp that when considering the data (X) the probabilityP (X < xp) ≤ p and the probability P (X > xp) = 1− p.

While this may be statsy, it generally says that the 50th quantile is the the value x50 inthe distribution where 50% of the data is less than x50 and 50% is greater than x50. Thusfar, you have probably call this the median (and R has a median() function if you like tocall it that). More generally though, we can consider the 95th quantile analogous to whatwe were discussing in Section 4.1.1 when we were trying to figure out critical regionsof the χ2 distribution. The main distinction here is in Section 4.1.1 we implicitly usedthe known distributional form of the χ2 function to find the critical value whereas innon-parametric approaches, we typically apply the approach of putting everything intoa vector, sorting it, and counting to where quantile is located in the list. As a result, the50th quantile (or median) can be considered a measure of central tendency of the sorteddata.

Quantiles can also be used to look at the dispersion of data. In parametric statisticswe discussed parameters such as the variance and standard deviation that define thedispersion of values around the mean. The notion of Quantiles can be used in a similarway. The values of x that give the upper and lower quartiles (e.g., the 25th and 75th

quantiles) provide a range of the data X where the inner 50% of the values lie. Theseare often called the inner quartiles of the data. To illustrate the use of the quantilefunction, consider the data in Figure 4.14 consisting of 1000 numbers drawn from aPoisson random distribution with a centrality parameter k = 5.

The quantile() function in R by default provides the 0th quantile (e.g., the minimum), the25th quantile, the 50th quantile (the median), the 75th quantile, and the 100th quantile(e.g., the maximum). For the data that produced the histogram in 4.14, the quantilesare:

> x <− rpois (1000 ,5)> quantile ( x )

0% 25% 50% 75% 100%0 3 5 6 12

showing that the center of dispersion is 5 and the inner quartile ranges from 3− 6.

4.4 Relationships Between Pairs of Variables

There is often times when we are interested in knowing about the simultaneous changesin two or more variables. Individually, we can estimate the mean, variance, skew, kur-tosis, and various ranges but this does not tell us about how the variables interacttogether. For this we need to look at measures that explain the relationship betweenvariables.

Biological Data Analysis Using R

64 CHAPTER 4. SUMMARY STATISTICS

Figure 4.14: Distribution of random number drawn from rpois(1000,5).

4.4.1 Covariance & Correlation

The covariance of two variable is defined as:

cij = E[(X − µX)(Y − µY )]

and measures the degree to which one variable X changes as another Y changes. Co-variance estimates may be positive or negative as long as the two variables are not thesame, in which case it is a variance and there is no such thing as a negative variance.Two variables that have a covariance equal to zero are said to be uncorrelated (althoughif you don’t know what a correlation is this moniker is kinda sucky).

In R the covariance between two vectors of values is estimated by the function cov().Needless to say, the length of the two variables must be the same or R will rightly com-plain.

> X <− c(1 ,34 ,5 ,23 ,6 ,43 ,56 ,28 ,33 ,7)> Y <− runif (10 ,1 ,100)> Y

Biological Data Analysis Using R

4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 65

[ 1 ] 90.112843 47.236585 17.148708 3.861546 54.871332 57.234582 8.072745[8 ] 6.000811 84.546069 17.960688

> plot (X,Y )> cov (X,Y )[1 ] 2231.952

Figure 4.15: Scatter plot of some semi-random points.

So here I just pounded on my numeric keypad and made up the numbers for X (notquite random but pretty good) and then had R make some numbers for Y by drawingfrom a uniform distribution runif() selecting 10 values in the range 1 → 100. You can seethat the values that I used produced a smattering of points (Figure 4.15 )

4.4.2 Tests For Correlation

There are parametric and non-parametric methods for looking at the relationship amongpairs of variables. In general, all correlations between two random variables (X,Y )should have the following characteristics:

• The value of a correlation is strictly bound on the interval [−1, 1].

Biological Data Analysis Using R

66 CHAPTER 4. SUMMARY STATISTICS

• If larger values of X tend to be associated with larger values of Y then the cor-relation should approach +1 as the association becomes stronger. We call this apositive correlation.

• If smaller values of X tend to be associated with larger values of Y then the cor-relation should approach −1 as the association becomes stronger. We call this anegative correlation.

• If there is no general relation between the variables X and Y then the correlationstatistic should approach 0. We call this a relationship where the variables areuncorrelated.

The most commonly used measure of correlation is Pearson’s product moment correla-tion, r, that is calculated as:

r =∑N

i=1(Xi − x)(Yi − y)∑Ni=1(Xi − x)

∑Ni=1(Yi − y)

(4.1)

where the x and y values are the mean of the N sampled variables in X and Y .

Figure 4.16: Example plot of two variables used to test correlations.

Biological Data Analysis Using R

4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 67

In R the test for correlation is performed with the cor.test () function. To demonstrate, wewill use the following data shown in Figure 4.16:

> X <− 1:20> Y <− c(−17, 7 , −12, 12, −4, 11, 10, −2, 35, 31, 34, 49, 27, 33, 45, 32, 36, 38, 58, 44)> cor . test (X,Y )

Pearson product−moment correlat ion

data : X and Yt = 7.3194, df = 18, p−value = 8.489e−07alternat ive hypothesis : true correlat ion is not equal to 095 percent confidence interval :0.6848344 0.9456427

sample estimates :cor

0.8651642

The correlation between these two variables is r = 0.865, which is both large and positiveas expected by looking at the graph. By default when you use cor.test (), it will use thePearson product moment approach. There are two additional approaches for estimatingcorrelation, approaches developed by Spearman and Kendal but these two are consid-ered non-parametric methods based upon ranks rather than that shown in Eqn. 4.1and will be left until 5.2.1 when we can fully discuss how it works. The output alsoincludes a significance test and a display of the 95% confidence intervals which are veryuseful.

Biological Data Analysis Using R

68 CHAPTER 4. SUMMARY STATISTICS

4.5 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• dchisq(x,df) Returns the density of the χ2 distribution with df degrees of freedom.

• df(x,df1,df2) Returns the density of the F distribution with df1 and df2 degrees offreedom.

• dnorm(x) Returns the density of a normal distribution at x.

• mean() Calculates the mean of the values in x.

• pchisq(x,df) Returns the distribution of the χ2 distribution with df degrees of free-dom.

• pf(x,df1,df2) Returns the distribution of the F distribution with df1 and df2 degreesof freedom.

• plot(x) This is the main wrapper function that creates a graphical display of thevariable(s) that you pass to it. Depending upon the variables passed, it will createdifferent types of plots.

• pnorm(x) Returns the distribution of a normal distribution at x.

• qchisq(x,df) Returns the quantile of the χ2 distribution with df degrees of freedom.

• qf(x,df1,df2) Returns the quantile of the F distribution with df1 and df2 degrees offreedom.

• qnorm(x) Returns the quantile of a normal distribution at x.

• rchisq(x,df) Returns x random numbers from the χ2 distribution with df degrees offreedom.

• rf(x,df1,df2) Returns x random numbers from the F distribution with df1 and df2degrees of freedom.

• rnorm(x) Returns x random numbers from the normal distribution.

• sd(x) Returns the sample standard deviation of data in x.

• table(f) This function takes the list of levels in the factor f and makes a table fromit.

• var(x) Estimates the sample variance, s2, from the variables in x.

Biological Data Analysis Using R

4.6. EXERCISES 69

4.6 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. What are the critical values for a χ2 distribution with df = 8 if you are assuming thatα = [0.2, 0.1, 0.01, 0.001]?

2. Create a scatter plot using the variables x<−rnorm(10) and y<−rpois(10,1). Label the axes”Jaw Size” and ”Number of Kids”.

3. For the probabilities p = seq(0.1,0.9,by=.1) create a graph that has a red line representingthe quantile function for the Poisson distribution (qpois with λ = 1) and a blue onerepresenting the quantile function for the χ2 distribution (qchisq with df = 1). Makesure to have your axes labeled and drawn properly. Save the image and include it inyour answer.

4. In a Platykurtic distribution what is the relationship between the mean, mode, andmedian?

5. Create a histogram of 1000 random numbers drawn from the F -distribution withparameters df1 = 1 & df2 = 10. On this plot, overlay the density using the density

function. Label the axes appropriately.

6. What is the inner-quartile of the data x <−rnorm(200,3)?

7. Is the data from the command x <−rf(1000,1,10), lepto, meso, or platykurtic? How do youknow?

8. Explain what is happening with the command data <−LETTERS[ rpois(23, 2 ) ]. Create a newvariable that is a table of the results of this command, show me the table, and showhow you would access the ”B” element in the table.

9. What is the range of possible values you can get for a Pearson’s Product-MomentCorrelation?

10. There is a data set named HWCorrelationData.csv in the folder. Load this data into R ,plot it an appropriate graphic, and then test the hypothesis HO :Height is independentof Weight.

Biological Data Analysis Using R

70 CHAPTER 4. SUMMARY STATISTICS

Biological Data Analysis Using R

Chapter 5

Contingency Tables

In this chapter we will examine non-parametric methodologies that are available forthe analysis of random variables. It is not uncommon in Biology to encounter the notionthat non-parametric approaches are only to be used with categorical (e.g., nominal) data.However, non-parametric analyses are just as applicable to normal ordinal and intervaldata that we commonly come into contact with and in this Chapter we will go over a fewexamples of how you can use general non-parametric statistical approaches in your ownresearch.

In this Chapter you will learn the following skills:

• Non-parametric analysis of data single categorical data set (x1, x2, . . . , xN ) using aχ2 test.

• Non-parametric analysis of paired data ( (x1, y1), (x2, y2), . . . , (xN , yN )) using the FisherExact for small data and the general χ2 test for large data sets.

• Non-parametric analysis of several random samples using the Kruskal-Wallis test.

For most of the exercises in this chapter you will need to load the stats library by issuingthe command: library(stats).

5.1 One Random Sample

For this section, we will assume that your data consist of N observations made on asingle variable, X = [x1, x2, . . . , xN ].

5.1.1 Goodness of Fit

The χ2 test for goodness of fit is the typical χ2 test that we have all had a million timesas an undergraduate and a graduate student. The data for this test consists of N obser-vations that can be categorized into K discrete Categories. In R we will use the factordata type (see 2.4.10 for more on the factor type).

71

72 CHAPTER 5. CONTINGENCY TABLES

The assumptions of this test are:

1. All the observations are selected randomly.

2. You can assign an observation to one of the K categories without error.

The test statistic for this analysis is the calculated χ2Calc which is:

χ2Calc =

K∑i=1

(Oi − Ei)2

Ei(5.1)

(5.2)

The underlying distribution of χ2Calc will be approximated using the χ2-distribution with

K − 1 degrees of freedom. From the discussion of this distribution and its depiction inFigure 4.2, it is large values of χ2

Calc that will lead to the rejection of the null hypothesis,HO.

Example Problem: Assume that we have captured a sample of the Marbled Salamander,Ambystoma opacum, from the Rice Center for Environmental Studies (a field station forVirginia Commonwealth University). On each of these individuals we have classifiedtheir marbling pattern as either Little White (NA = 24), Moderate White Marbling (NB =47), and Mostly White (NC = 29). A separate crossing experiment has suggest that themarbling on an individual may be under the control of a limited number of genetic lociand has predicted that the frequency of these types would be 1 : 2 : 1 in populations atequilibrium. Do the proposed mechanisms predict a distribution of phenotypes that yousampled from the wild? To test the hypothesis, HO :Phenotypes occur at a ratio of 1 : 2 : 1in R we would:

> Phenotypes <− as . factor ( c ( rep ("Little White" ,24) ,+ rep ("Marbled" ,47) , rep ("Mostly White" ,29) ) )> p <− c ( 1 , 2 , 1)> p <− p / sum(p ) # makes p a vector of probabi l i t ies> table ( Phenotypes )PhenotypesL i t t l e White Marbled Mostly White

24 47 29> chisq . test ( table ( Phenotypes ) , p = p )

Chi−squared test for given probabi l i t ies

data : table ( Phenotypes )X−squared = 0.86 , df = 2, p−value = 0.6505

So here, the observed and expected values were relatively close to each other producinga χ2

Calc (in R called ”X-squared”) of 0.86, which with df = 2 has a P -value of 0.6505. Notsomething that would be considered rare. As a result, we fail to reject HO that the ratioof phenotypes is 1 : 2 : 1.

Here the thing that was passed to the chisq.test function was an object of class table. Thisis only one way that you can pass data to to the chisq.test function. See ?chisq.test for moreinformation on other ways to pass your data to this function.

Biological Data Analysis Using R

5.1. ONE RANDOM SAMPLE 73

5.1.2 Binomial Test

The binomial test evaluates the support for the probability (p) that an observation wascategorized into one of two groups. The following assumptions are inherent in the bino-mial test:

1. Each observation has the ability to be characterized as either Category A or Cate-gory B and the probably of assigning to A is denoted as p (and B as 1− p).

2. Each of the N observations are mutually independent.

The binomial test tests to see if the number of items you have classified as Category Ais rare given a specified probability, p. The test itself is performed using the binom.test()

function. In the example below, I am considering the situation where a coin was flipped20 times and was found to have shown Heads only six times. The hypothesis is: HO :p = 0.5. The function itself need a few pieces of data; the number of times Category Awas observed (as x), the total number of trials (as n), and the hypothesized probability p.Call it with these data would be done as:

> binom. test ( x=6, n=20, p=0.5 )

Exact binomial test

data : 6 and 20number of successes = 6, number of t r i a l s = 20, p−value = 0.1153alternat ive hypothesis : true probabil i ty of success is not equal to 0.595 percent confidence interval :0.1189316 0.5427892

sample estimates :probabil i ty of success

0.3

These results suggest that even with only 6 observed Heads in 20 flips, we cannot rejectHO that it is a fair coin. However, the 95% confidence intervals show that there is a largerange of values we cannot reject...

5.1.3 General Contingency Tables

For this next application of a contingency tables we will focus on data describing thediversity of students in the College of Humanities & Sciences at Virginia CommonwealthUniversity. These data are reported by all public institutions and can be found for VCU atthe webpage http://www.vcu.edu/cie/analysis/reports/sets.html and are summarizedin Table 5.1.

In general, we are going to create a contingency table that has the general form:

Col 1 Col 2 Col 3 · · · Col c Totals

Row 1 O11 O12 O13 · · · O1c R1

Row 2 O21 O22 O23 · · · O2c R2...

......

.... . .

......

Row r Or1 Or2 Or3 · · · Orc Rr

Totals C1 C2 C3 · · · Cc N

Biological Data Analysis Using R

74 CHAPTER 5. CONTINGENCY TABLES

with r rows of data and c columns. Each of the entries in the rxc contingency table (theOij values) are counts of the number of observations that were classified as belongingto the category in the ith row and the jth column. Above, when we looked at the χ2

test, it was a smaller version of this table and the test statistic for analyses in generalcontingency tables are the same as above:

χ2Calc =

r∑i=1

c∑j=1

(Oij − Eij)2

Eij

The only distinction here is that our expected values are based upon row and columntotals such that:

Eij =RiCjN

where Ri and Cj are the respective row and column total.

There are two specific assumptions that are required to conduct a general contingencytable test such as this:

1. The sample of N samples are drawn randomly from the larger population.

2. Each observation can be classified into exactly one of the possible r and c categoriesaccording to single and independent criteria (e.g., there is no correlation betweenthe row and column variables).

Biological Data Analysis Using R

5.1

.O

NE

RA

ND

OM

SA

MPLE

75

Table 5.1: Diversity of enrolled undergraduate students at Virginia Commonwealth University in the College of Hu-manities & Sciences between the academic years 1998-2008 as reported by the Center for Institutional Effectiveness(http://www.vcu.edu/cie/analysis/reports/sets.html).

Group 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Non-resident Aliens 186 158 188 208 206 235 272 375 512 577 673Black non-Hispanic 2985 3094 3282 3332 3387 3456 3633 3797 3983 4158 4193American Indian or Alaskan Native 91 80 83 86 90 113 109 116 124 131 131Asian or Pacific Islander 1103 1139 1132 1175 1231 1437 1632 1764 1970 2148 2330Hispanic 279 305 362 400 449 521 559 623 709 761 822White, non-Hispanic 8688 8586 9013 9373 9916 10077 10757 11088 11180 11170 11202Race/ethnicity unknown 0 188 208 279 387 665 849 928 1019 1287 1642

Total 13332 13550 14268 14853 15666 16504 17811 18691 19497 20232 20993

Biological

Data

An

alysisU

sing

R

76 CHAPTER 5. CONTINGENCY TABLES

To demonstrate this analysis we will analyze the 1998, 2003 and 2008 enrollment data fromTable 5.1 to see if the diversity of students at VCU has changed over the last decade.These data are present in a text file named VCUCommonData.csv in the folder for this Chapter.It is loaded into R with the following commands.

> data <− read . table ("VCUCommonData.csv" , header=T, sep=" " )> summary( data )

Yr1998 Yr1999 Yr2000 Yr2001Min. : 0.0 Min. : 80 Min. : 83 Min. : 86.01st Qu. : 138.5 1st Qu. : 173 1st Qu. : 198 1st Qu. : 243.5Median : 279.0 Median : 305 Median : 362 Median : 400.0Mean :1904.6 Mean :1936 Mean :2038 Mean :2121.93rd Qu.:2044.0 3rd Qu.:2116 3rd Qu.:2207 3rd Qu.:2253.5Max. :8688.0 Max. :8586 Max. :9013 Max. :9373.0

Yr2002 Yr2003 Yr2004 Yr2005Min. : 90.0 Min. : 113 Min. : 109.0 Min. : 1161st Qu. : 296.5 1st Qu. : 378 1st Qu. : 415.5 1st Qu. : 499Median : 449.0 Median : 665 Median : 849.0 Median : 928Mean :2238.0 Mean : 2358 Mean : 2544.4 Mean : 26703rd Qu.:2309.0 3rd Qu. : 2446 3rd Qu. : 2632.5 3rd Qu. : 2780Max. :9916.0 Max. :10077 Max. :10757.0 Max. :11088

Yr2006 Yr2007 Yr2008Min. : 124.0 Min. : 131 Min. : 131.01st Qu. : 610.5 1st Qu. : 669 1st Qu. : 747.5Median : 1019.0 Median : 1287 Median : 1642.0Mean : 2785.3 Mean : 2890 Mean : 2999.03rd Qu. : 2976.5 3rd Qu. : 3153 3rd Qu. : 3261.5Max. :11180.0 Max. :11170 Max. :11202.0

Once the entire data set is loaded into R , we can extract only the values that we aregoing to use.

> Obs <− as . matrix ( cbind ( data$Yr1998 , data$Yr2003 , data$Yr2008 ) )> Obs

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 186 235 673[2 , ] 2985 3456 4193[3 , ] 91 113 131[4 , ] 1103 1437 2330[5 , ] 279 521 822[6 , ] 8688 10077 11202[7 , ] 0 665 1642> colnames ( Obs ) <− c ("1998" ,"2003" ,"2008" )> rownames( Obs ) <− c ( "Non-resident Aliens" , "Black non-Hispanic" ,+ "American Indian or Alaskan Native" , "Asian or Pacific Islander" ,+ "Hispanic" , "White, non-Hispanic" , "Race/ethnicity unknown" )> Obs

1998 2003 2008Non−resident Aliens 186 235 673Black non−Hispanic 2985 3456 4193American Indian or Alaskan Native 91 113 131Asian or Pac i f i c Islander 1103 1437 2330Hispanic 279 521 822White , non−Hispanic 8688 10077 11202Race/ethnicity unknown 0 665 1642

With these data we will be specifically testing the hypothesis that across years there isno differences in the relative distributions of self-identified racial and ethnic group.

In some texts, this (7x3) contingency test is called a χ2 Test for Independence and in Ris conducted using the chisq.test(). To begin with, we can plot the categories as the barplot(see 8.2.1 for how to make these plots yourself) as represented in Figure 5.1.

Biological Data Analysis Using R

5.1. ONE RANDOM SAMPLE 77

Figure 5.1: Undergraduate diversity at Virginia Commonwealth University during academic years1998, 2003, & 2008.

> test1 <− chisq . test ( Obs )> test1

Pearsons Chi−squared test

data : ObsX−squared = 1704.417, df = 12, p−value < 2.2e−16

> summary( test1 )Length Class Mode

s ta t i s t i c 1 −none− numericparameter 1 −none− numericp . value 1 −none− numericmethod 1 −none− characterdata .name 1 −none− characterobserved 21 −none− numericexpected 21 −none− numericresiduals 21 −none− numeric

Notice here that I actually assigned the results of the statistical test to the variabletest1. I did this because there are many reasons why you may be interested in lookinga various aspects of the analysis. By printing the contents of the test itself, we see that

Biological Data Analysis Using R

78 CHAPTER 5. CONTINGENCY TABLES

the calculated statstic χ2Calc = 1704.417, which with (r − 1) ∗ (c− 1) = 6 ∗ 2 = 12df produces

a very small P−value. If you look back at Figure 4.2, our observed value is way out tothe right with a very small likelihood that that you would get a value this large if it werenot significant.

As shown using the function summary(test1) shows, the analysis itself returns a list thathas all the components as list items. There are a lot of different reasons why you may beinterested in using various components of the analysis. For example, you may want tocreate a table of the observed or expected values, you may need to run this test a largenumber of times and store

Caveats

There are some caveats that need to be made with respect to general use of contingencytables. First, they are very robust as long as you have a moderate amount of samplesin each of the cells. The test statistic we have been using, χ2

Calc with (r − 1) ∗ (c − 1)df isactually an approximation that is good only with good representation. If the values in thecells are small then the approximation that we use to find the Type I error (the α value)is poorly estimated. OK but what is moderate? Here are some general guidelines:1

1. If any of the Eij estimates are less than 1 the approximation will be poor.

2. If more than 20% of the Eij values are less than 5 then the approximation will bepoor.

So what do you do if you have some small expected values? First, you can try to col-lapse some of your row or column categories and recalculate. It really depends uponyour knowledge of the biology of the system if this can be done without making it ameaningless analysis.

Second, you can try to use Fishers Exact Test. This uses combinatorial theory to esti-mate the probabilities of the test statistic rather than asymptotic assumptions. This isan excellent choice but has the problem that since it use combinatorial theory, at somepoint you will have to perform an operation like N ! which when N > 170 the computercannot calculate a number that large. There is also the restriction that product of therow marginals (the Ri values in the table) must be strictly less than 231−1 but he N < 170rule is a bit easier to remember.

5.2 Paired Observations

Analyses in this section will be concerned with data that is collected in a pair-wisefashion (e.g., for each observation, there are two values collected).

1These guidelines are a bit on the conservative side and you may want to see a text on non-parametricstatistics for a more complete discussion of how far you can stray from these and still not get laughed at.

Biological Data Analysis Using R

5.2. PAIRED OBSERVATIONS 79

5.2.1 Rank Correlation

In 4.4.2 we looked at how you use the cor.test function to get a parametric estimate of thecorrelation between two sets of variables. This is possible as well using a non-parametricapproach by adopting a ranking methodology. Non-parametric correlation methods in-clude Spearman’s ρ and Kendal’s τ , among others but the interface in R is identical (andthe same as we already saw for the Pearson product moment correlation) so I will onlycover the Spearman approach and leave you to look into the differences.

Spearman’s correlation statistic, ρ, is calculated as:

ρ =∑N

i=1R[Xi]R[Yi]−N(N+1

2

)2(∑Ni=1R[Xi]2 −N

(N+1

2

)2) 12(∑N

i=1R[Yi]2 −N(N+1

2

)2) 12

(5.3)

where the terms R[Xi] is the rank of the ith element in X. These ranks are computedin comparison to other values in X. For example R[Xi] = 1 is the smallest value of X,R[Xi] = 2 would be the second smallest, etc. So what is begin done here is that we arereplacing the actual values of the variables by the relative ranks.

Using the same data as in 4.4.2 you specify the use of the Spearman approach usingranks by passing it as an additional option to the cor.test function.

> X <− 1:20> Y <− c(−17, 7 , −12, 12, −4, 11, 10, −2, 35, 31, 34, 49, 27, 33, 45, 32, 36, 38, 58, 44)> cor . test (X,Y,method="spearman" )

Spearmans rank correlat ion rho

data : X and YS = 198, p−value < 2.2e−16alternat ive hypothesis : true rho is not equal to 0sample estimates :

rho0.8511278

Notice here that the correlation is significant although the correlation statistic is a bitsmaller. There is some loss of information by putting the data into ranks rather thanusing the raw values.

So why use this instead of the parametric approaches? Well the calculation of Pearson’sr statistic depends upon the bivariate distribution of X and Y . If there is no knownjoint distribution for these variables then the density function of r is undefined. Whatdoes this mean to you? It means that if your data can be assumed to be normal or thengo ahead and use the Pearson approach. However, if you cannot assume that they arenormal or they you know they are not, then a rank approach may be more appropriate.For me, I consider the non-parametric approaches as appropriate for all data, whereasthe parametric ones as only good for a subset of the data that we encounter.

Biological Data Analysis Using R

80 CHAPTER 5. CONTINGENCY TABLES

5.2.2 Wilcoxon Test

The Wilcoxon test is also known as the Mann-Whitney test and a ranks based methodanalogous to the a paired t-test. This approach tests the null hypothesis that samplesdrawn from two different populations are essentially the same (e.g., they are as likely assamples drawn from one or the other population). Data here are drawn randomly fromtwo different ”treatments” to see if the application of either produces a significant shiftin the values of one set of observations.

As was discussed for Spearman’s ρ, samples will be ranked in increasing order for thisanalysis. If the ranks in sample X tend to be generally larger or smaller than thoseobservations in Y then we can reject the null hypothesis HO : X = Y . In general yourdata should look like:

Treatment 1 Treatment 2X1 Y1

X2 Y2

. . . . . .Xn Ym

In this analysis, we do not assume that both X and Y have the same number of obser-vations and in general will consider X to have n observations while Y has m and denoteN = n+m. Samples are lumped together and assigned ranks based upon the combinedN observations. In the case of ties where two or more samples have the exact samevalue, it is recommended to assign the average rank to all the tied observations. For-tunately for us, the internal R code takes care of this for us (and will provide warningswhen appropriate) so we can focus on our tasks and let R focus on the specifics.

Assumptions

The Wilcoxon test has the following assumptions:

1. Both sets of samples (the X and Y observations) are drawn randomly form eachpopulation.

2. There is an expected mutual independence between the X and Y values as well.

3. The variables are at least ordinal.

The test statistic for this analysis is the sum of the ranks of the X variables:

W =n∑i=1

R[Xi]

If the observations in X and Y are drawn from a single population, as stated in the nullhypothesis, then the sum of the ranks of X should be just as large as expected for thesum of the ranks for Y . If the treatments are producing differences in either X or Y thenthe test statistic will be unusually large given N .

Biological Data Analysis Using R

5.2. PAIRED OBSERVATIONS 81

To show how to conduct the Wilcoxon test, I will use the pine germination data that is inthe folder for this Chapter. These data are from my thesis and record the average germi-nation rates for offspring arrays of Pinus echinata families who were sampled in continu-ous (CTRL), selectively cut (SEL), and stands where all the trees around P. echinata wereclear-cut (CLR). Here we will use the Wilcoxon to see if there is a significant differencein germination rates between the control (CTRL) and clear-cut treatments (CLR). Here ishow to load the data into R and extract just the treatments of interest.

> pineData <− read . table ("PineGerminationData.txt" ,header=T )> summary( pineData )

GERM TRTMin. :0.0000 CLR :151st Qu.:0.1800 CTRL:23Median :0.3700 SEL :15Mean :0.36253rd Qu.:0.5700Max. :0.9400> X <− pineData$GERM[ pineData$TRT=="CLR" ]> Y <− pineData$GERM[ pineData$TRT=="CTRL" ]> length (X)[1 ] 15> length (Y )[1 ] 23> X

[1 ] 0.67 0.64 0.94 0.40 0.01 0.45 0.58 0.00 0.80 0.81 0.21 0.36 0.82 0.35 0.41> Y

[1 ] 0.63 0.29 0.37 0.56 0.19 0.02 0.06 0.07 0.11 0.18 0.03 0.64 0.21 0.00 0.00[16] 0.53 0.00 0.00 0.00 0.00 0.35 0.39 0.37> mean(X)[1 ] 0.4966667> mean(Y )[1 ] 0.2173913> range (X)[1 ] 0.00 0.94> range (Y )[1 ] 0.00 0.64

You can see that there are different numbers of samples in each treatment but that theyhave overlapping ranges. To run the Wilcoxon test, use the function wilcox.test and pass itthe two variables.

> wilcox . test (X,Y )

Wilcoxon rank sum test with continuity correction

data : X and YW = 269.5, p−value = 0.003835alternat ive hypothesis : true location sh i f t is not equal to 0

Warning message :In wilcox . test . default (X, Y ) : cannot compute exact p−value with t i es

According to our test, the data in X and Y appear to be different. The test statistic, W =269.5 which gives it a P -value of 0.004. There are some error messages that you shouldbe aware of. Apparently in the data, there were ties and this causes some problemsin calculating the significance of the parameter. These ties are for families that did notproduce any offspring. From a biological perspective, these are valid responses and youwould have to just live with the fact that ties existed because throwing out all the 0.00values changes the interpretation of what happened.

Biological Data Analysis Using R

82 CHAPTER 5. CONTINGENCY TABLES

In general, the Wilcoxon test is rather powerful in determining the equality of samplesdrawn from two different populations. It is essentially the non-parametric version of thenormal t-test.2 Situations where you may favor a Wilcoxon approach over the t-test arewhen you have non-normal data or data with several outlier points.

5.3 Several Random Samples

The final section in this chapter is focused on data that is collected from multiple treat-ments. In the previous discussion of the Wilcoxon test, the data had k = 2 treatmentsand it was introduced as a rank based analog of the t-test. Here we will introduce theKruskal-Wallis test which allows for the analysis of k > 2 treatments and we could againconsider it a rank-based analog of an analysis of variance (ANOVA) approach.

5.3.1 Kruskal-Wallis Tests

The Kruskal-Wallis test examines the differences among k different treatments using arank-based approach similar to that discussed for the Wilcoxon test. In fact, this test isjust an extension of the Wilcoxon test using the same sum or ranks approach.

Data for this test is not assumed to be of equal sizes. Each treatment may have adifferent number of observations in it with a total sample size of: N =

∑ki=1 ni. You

should be able to make a list of your data by treatment such as:

Treatment 1 Treatment 2 · · · Treatment kX11 X21 · · · Xk1

X12 X22 · · · Xk2...

......

...X1n1 X2n2 · · · Xknk

The test statistic for this test is a χ2 approximation with k − 1 degrees of freedom

Assumptions

There are several assumptions associated with this test:

1. All samples are randomly drawn from their perspective treatments.

2. Treatments are independent of each other.

3. The observations are at least ordinal in nature.

As an example using this analysis, we will examine the same Pinus echinata data setthat we used to demonstrate the Wilcoxon test. The default method for performing thisanalysis looks like kruskal.test(x, g, ...) where the variable x is the raw data and the g one isanother variable that has the groupings. In the code below I separate out the variables

2Actually if you do a t-test on the ranks you will get the same answer as the Wilcoxon, the approachesare identical except for how the data are encoded; raw or as ranks.

Biological Data Analysis Using R

5.4. THE FORMULA NOTATION & BOX PLOTS 83

and then pass them to the function with Germination as the response and grouped by thefactor Treatment. I also conduct the analysis and assign it to the variable named germTest

so you can see that this analysis also returns a list of results.

> pineData <− read . table ("PineGerminationData.txt" ,header=T )> GerminationRates <− pineData$GERM> Treatment <− as . factor ( pineData$TRT )> Treatment

[1 ] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL[16] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL SEL SEL SEL SEL SEL SEL SEL[31] SEL SEL SEL SEL SEL SEL SEL SEL CLR CLR CLR CLR CLR CLR CLR[46] CLR CLR CLR CLR CLR CLR CLR CLRLevels : CLR CTRL SEL> GerminationRates

[1 ] 0.630 0.290 0.370 0.560 0.190 0.020 0.060 0.070 0.110 0.180 0.030 0.640[13] 0.210 0.000 0.000 0.530 0.000 0.000 0.000 0.000 0.350 0.390 0.370 0.580[25] 0.490 0.450 0.380 0.510 0.570 0.240 0.290 0.620 0.520 0.200 0.240 0.615[37] 0.760 0.300 0.670 0.640 0.940 0.400 0.010 0.450 0.580 0.000 0.800 0.810[49] 0.210 0.360 0.820 0.350 0.410> germTest <− kruskal . test ( GerminationRates , Treatment )> summary( germTest )

Length Class Modes ta t i s t i c 1 −none− numericparameter 1 −none− numericp . value 1 −none− numericmethod 1 −none− characterdata .name 1 −none− character> germTest

Kruskal−Wallis rank sum test

data : GerminationRates and TreatmentKruskal−Wallis chi−squared = 12.539, df = 2, p−value = 0.001893

When looking at the results of the test, we see that the estimated test statistic wasrelatively large suggesting that it is unlikely that the three timber extraction treatmentsdo not differentially influence the germination percentages.

5.4 The Formula Notation & Box Plots

If you look at the function signature for the kruskal.test (by typing ?kruskal.test into R ), youcan see several alternate ways you can pass your data to it.

kruskal . test package : stats R Documentation

Kruskal−Wallis Rank Sum Test

Description :

Performs a Kruskal−Wallis rank sum test .

Usage :

kruskal . test ( x , . . . )

## Default S3 method:kruskal . test ( x , g , . . . )

## S3 method for class ’ formula ’ :kruskal . test ( formula , data , subset , na. action , . . . )

Biological Data Analysis Using R

84 CHAPTER 5. CONTINGENCY TABLES

When discussing the relationship between the raw germination data and the groupingvariable, I used the statement ”...is a function of...” This notation is the formula notationthat is indicated in the last option for calling the kruskal.test function. In R you can oftenuse the formula notation to perform analyses and plots and here we will spend a littlebit of time on how that is done. In Chapter 6 you will use this notation quite a bit whenwriting out linear models.

The formula notation in R consists of the response variable (or variables that I’ll callY ), the predictor variable (or variables which will be denoted as X), and the tilde signshowing the relationship. For example, a simple function would be denoted as Y ˜ X

stating that Y is a function of X. Using the function notation for the kruskal.test wouldlook like:

> kruskal . test ( GerminationRates ˜ Treatment )

Kruskal−Wallis rank sum test

data : GerminationRates by TreatmentKruskal−Wallis chi−squared = 12.539, df = 2, p−value = 0.001893

Figure 5.2: Boxplot of Pinus echinata germination data partitioned by timber extraction treatment.

Biological Data Analysis Using R

5.4. THE FORMULA NOTATION & BOX PLOTS 85

It is even possible (and perhaps better because we are rather lazy in our typing) to usethe function notation of the variable names within a data.frame without having to make theother variables (GerminationRates and Treatments). However, when you do this, you will haveto pass an additional parameter to the analysis function to tell it which data to look intofor those variable names. For example, with the pineData data set you can type:

> kruskal . test ( GERM ˜ TRT, data=pineData )

Kruskal−Wallis rank sum test

data : GERM by TRTKruskal−Wallis chi−squared = 12.539, df = 2, p−value = 0.001893

Another common place to find the function notation is in plotting. Thus far, we havecalled scatter plots by the function plot(x,y). It is just as easy to call the plot as plot(y ˜ x)

and you will get the same results if the variable x is a continuous variable. However, ifx is a categorical variable you will not get a normal scatter plot. What you will get is abox plot as depicted in Figure 5.2 which was created by calling the function3:

> plot (GERM ˜ TRT, data=data , xlab="Treatment" , ylab="GerminationRate" )

3To adjust additional parameters on the box plots see the function bxp which is the actual plottingfunction that the plot function is handing the data off to. You can adjust many other components of the plotincluding notches, box colors, etc.

Biological Data Analysis Using R

86 CHAPTER 5. CONTINGENCY TABLES

5.5 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• as.factor(x) Coerces the data in x into a factor data type.

• as.matrix(x) Coerces the data in x into a matrix data type if possible.

• binom.test(x,n,p) Performs a binomial test to see if observing x occurrances of onecategory of data in n trials is consistent with the likelihood of it occuring with afrequency of p.

• c(x,y) The concatinate function that munges all the items together and returnsthem as a vector.

• cbind(x,y,...) Binds together the data in x, y, etc. by columns.

• colnames(x) Access the column names in the item x. This only works for matricesand data.frames.

• cor.test(x) Tests for a significant (e.g., ρ 6= 0) correlations.

• chisq.test(t) Performs a χ2 test on the values in the table t.

• kruskal.test(x,g) Performs the Kruskal-Wallis Rank Sum test for the data in x aspartitioned into groups defined by g.

• length(x) Returns the length of x.

• mean(x) Returns the mean of the items in x.

• range(x) Returns a two-element vector containing the minimum and maximum val-ues in x.

• read.table() Reads in a raw data into R .

• rownames(x) Access the row names in x. This only works ofr matrices or data.frames.

• summary(x) Returns a general summary of the data in x.

• table(f) This function takes the list of levels in the factor f and makes a table fromit.

• wilcox.test(x,y) Performs the Wilcoxon Rank Sum Test on the variables in x and y.

Biological Data Analysis Using R

5.6. EXERCISES 87

5.6 Exercises

The following exercise are meant to help you understand the items presented in thisChapter

1. Calculate the relative proportions of each group in the 1999 VCU data and use thegoodness of fit approach (as in 5.1.1 to see if the 2008 student class has the samerelative proportions as are predicted by the 1999 class.

2. Compare the enrollment freshmen enrollment in the College of Humanities & Sciencesat VCU (from Table 5.1) during the 2006-2007 academic year for Degree-Seeking Un-dergraduates to the three Universities listed below. Is the student diversity acrossthese institutions the same? These data sets are prepared each academic year byeach public institution and can be found by searching for ”Common Data Sets” andlooking at Enrollment & Persistence. Below are the places you can get this informa-tion for three Universities in our region.

• Auburn University https://oira.auburn.edu/cds/2006/sectionb.aspx

• University of Virginia http://www.web.virginia.edu/IAAS/data catalog/institutional/cds/current/enrollment.htm

• Virginia Tech http://www.ir.vt.edu/common ds 2006.htm

3. Use the wilcoxon.test to see if the germination rates observed in the SEL and CLR treat-ments are significantly different. Provide some interpretation of your results.

4. Load the data into R that is found in the file CornOutput.csv (Note: this data is tab-delimited so you will have to adjust the separator you use in the read.table function),These data represent the output in numbers of bushels per acre of corn with three dif-ferent fertilizer treatments. Create a density plot showing the distribution of bushelsyielded by each treatment.

5. Test the equality of the fertilizers in the data loaded from the last question using aKruskal-Wallis test. Interpret your results.

6. What are the inner-quartiles of the three fertilizer yields?

7. From a total of N = 15 students in this course, if 14 pass, is the probability of passingthis course equal to p = 0.65?

8. What does the optional parameter rescale.p change in the chisq.test function? Why wouldyou want to use this option?

9. Assume that you observed phenotypes in the following amounts: nspots = 12 individ-uals with spots, nsilky = 22 with silky fur, nSmooth = 15 smooth coated, and naguti = 8aguti. Do these data fit the hypothesis that the probability of any one of these phe-notypes is equal?

10. Create a data three variables named First, Second, and Third and assign each of themthe value of runif(3). Now, create a bar plot of these data assuming that the first entryin each data set represents Category A, the second Category B, and the third Category C.Make it look something like Figure 5.1 with the Categories used as the partitioningvariable along the x-axis. Feel free to provide your own colors.

Biological Data Analysis Using R

88 CHAPTER 5. CONTINGENCY TABLES

Biological Data Analysis Using R

Chapter 6

Linear Models

This chapter focuses on the analysis of linear models in R . The term ”linear model” is ageneral one that will be used a bit loosely. In general, a linear models is one that can bewritten down in the form:

y = x

Some variable, or set of variables, y, are predicted to have a particular relationship withsome predictor variable (or variables) denoted in x. In the simplest case when both xand y are continuous variables, the analysis is called a regression analysis, if x hasmore than one predictor variable then it is called a multiple regression, and if y is binaryit is a logistic regression. However, if the predictor variable is categorical the modelis called an analysis of variance with many variants depending upon the number andrelationship of categorical predictor variables in x. Finally, if predictor variables consistof categorical and continuous variables then it is called an analysis of covariance. Thereare many different ways of introducing these different kinds of analysis but we are goingto focus on the functional form and the kinds of variables that make up the predictorx.

In this Chapter you will learn the following skills:

• Learn to analyze data using a simple regression approach.

• Be able to incrementally build a multiple regression model using Type III sums ofsquares.

• Perform an analysis of variance (ANOVA) analysis for both 1-way and factorial mod-els.

6.1 The t-test

6.1.1 One-Sample Tests

The first linear model we will deal with is the t-test. The functional form of this is:

89

90 CHAPTER 6. LINEAR MODELS

y = µ

where we believe that the observations sampled in y have some particular mean valueand the variation around that mean value is simply the natural variation there is is thekind of samples we are measuring. The function that performs the one-sample t−test inR is (not surprisingly) called t.test and has the following options available to it.

t . test ( x , y = NULL,alternat ive = c ("two.sided" , "less" , "greater" ) ,mu = 0, paired = FALSE, var . equal = FALSE,conf . l eve l = 0.95 , . . . )

For a one-sampled test, we will pass the response variable and a value for the parametermu to the function. By default, it will test the null hypothesis HO : y = µ (the mu inthe signature) using a ”two.sided” alternate hypothesis. This means that we can rejectthe null if y < µ and if y > µ using a α

2 rejection region. If you have reason to believethat the observations are supposed to increase or decrease µ over some particular value,something along the lines of say ”the addition of fertilizer should increase yield,” thenyou should be using a one-tailed test instead that only examines an α-sized region oneend.

In the data below, we are testing the hypothesis that HO : y = 15 with the given data.

> Y <− c(19,25,14,15,24,17,19,27,29,25)> test1 <− t . test (Y,mu=15)> summary( test1 )

Length Class Modes ta t i s t i c 1 −none− numericparameter 1 −none− numericp . value 1 −none− numericconf . int 2 −none− numericestimate 1 −none− numericnull . value 1 −none− numerical ternat ive 1 −none− charactermethod 1 −none− characterdata .name 1 −none− character> print ( test1 )

One Sample t−test

data : Yt = 3.8523, df = 9, p−value = 0.003892alternat ive hypothesis : true mean is not equal to 1595 percent confidence interval :17.64182 25.15818

sample estimates :mean of x

21.4

You can see that I assigned the results of the analysis to the variable named test1. Justas in the contingency tables examples (5.1.3 & 5.2.2) the results of an analysis are alist containing all the parameters that were used to perform the analysis as well asintermediary materials and results. Of particular mention are the parameters p.value,conf.int, and statistic. Overall, the analysis found that we can reject the null hypothesisHO : y = 15 with a P -value of ≈ 0.004. This is fairly good support for the notion that themean of these observations is not equal to 15.

Biological Data Analysis Using R

6.2. REGRESSION WITH A SINGLE VARIABLE 91

6.1.2 Paired Tests

The t-test can also be used in a paired fashion. This analysis consists of two sets ofvariables, X and Y that are observations that are taken in such a manner as to thinkthat the differences between them are negligible. For example, perhaps you think thatparasite load has influenced the development of young warblers so you measure thelengths of the primary feathers. Overall the null hypothesis for this is HO : X = Y .Another way to write this hypothesis is: HO : (X − Y ) = 0, in which case this becomesidentical to the one-sampled test. An example of this in R (with entirely contrived data)would be:

> X <− round ( runif (10 ,min=12,max=20))> Y <− round ( runif (10 ,min=12,max=20))> X

[1 ] 12 18 18 13 14 15 15 16 17 19> Y

[1 ] 14 17 20 13 17 12 16 17 17 15> t . test (X,Y, paired=T )

Paired t−test

data : X and Yt = −0.1416, df = 9, p−value = 0.8905alternat ive hypothesis : true di f ference in means is not equal to 095 percent confidence interval :−1.697808 1.497808

sample estimates :mean of the dif ferences

−0.1

Notice that since these are paired, they must be taken from the same experimental unit,which is why we added the paired=T option to the parameters we passed to t.test.

6.2 Regression With A Single Variable

A linear regression seeks to see if the values in the response variable y can be predictedto change systematically with the predictor variable x. The general form of a regressionmodel is:

yij = β0 + β1xi + ej

where the response variable yij is hypothesized to be a function of three independentcomponents:

1. The intercept, β0.

2. A slope coefficient, β1 that determines at what rate y changes with changes in x.

3. The error term, ej, is the latent variation that every observed value has around thepredicted regression line.

The methods by which the parameters β0 and β1 are estimated are varied. The mostcommon approach is the least squares approach which tries to find estimates for these

Biological Data Analysis Using R

92 CHAPTER 6. LINEAR MODELS

two parameters that minimizes the sum of squared error terms (e.g.,∑N

i=1 ei). In R wecan use the function lm to construct the linear model. Here is an example data set withthe values plotted in Figure 6.1.

Figure 6.1: Plot of single variable regression values.

> X <− 1:10> X

[1 ] 1 2 3 4 5 6 7 8 9 10> Y <− c(19,25,14,15,24,17,19,27,29,25)> Y

[1 ] 19 25 14 15 24 17 19 27 29 25> plot (Y˜X, xlab="X" , ylab="Y" , bty="n" , col="red" ,pch=19,ylim=c (0 ,30) , xlim=c (0 ,10) )

To plot these, I used the functional form (see 5.4 for a discussion of how this works)with Y ˜ X, set the labels, the plot colors, the ranges of the x− and y−axes, and theplot characters with the pch option.1 By eye-balling the image, do you think there is arelationship between these variables?

> f i t 1 <− lm(Y˜X)> f i t 1

1To see all the different characters that you can use as plot symbols type plot(1:25,pch=1:25) and it willplot each symbol along the x = y line.

Biological Data Analysis Using R

6.2. REGRESSION WITH A SINGLE VARIABLE 93

Call :lm( formula = Y ˜ X)

Coef f ic ients :( Intercept ) X

16.3333 0.9212

I start by assigning the response of the analysis to the variable fit1. Printing the contentsof the analysis shows that the intercept term (the β0) has been estimated to be 16.333whereas the slope term (R calls this by the variable name you use for it and above wecalled it β1) as 0.92. So for each increment of X, there is almost a corresponding increasein Y (OK since the points do kinda point upwards). But is this significant? You can havea non-zero estimate for a non-significant relationship. To see a slightly more detailedprintout of the components in fit1 use the summary function.

> summary( f i t 1 )

Call :lm( formula = Y ˜ X)

Residuals :Min 1Q Median 3Q Max

−5.097 −4.591 0.600 3.238 6.824

Coef f ic ients :Estimate Std . Error t value Pr (>| t | )

( Intercept ) 16.3333 3.2258 5.063 0.000973 ∗∗∗X 0.9212 0.5199 1.772 0.114348−−−Signi f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 4.722 on 8 degrees of freedomMultiple R−squared : 0.2819, Adjusted R−squared : 0.1921F−s ta t i s t i c : 3.14 on 1 and 8 DF, p−value : 0.1143

Here we see several components:

1. The formula that was used to call the lm function.

2. A summary of the residuals (the eij terms)

3. The coefficients themselves with standard errors and probabilities.

4. A summary of the test statistic, F , the df , and the probability.

Overall, it does not appear that the regression line is significant. If you are interested inprinting out a more standard ANOVA table for this model, you can pass the variable fit1

to the anova function and it will print out the more normal results.

> anova ( f i t 1 )Analysis of Variance Table

Response : YDf Sum Sq Mean Sq F value Pr(>F)

X 1 70.012 70.012 3.1398 0.1143Residuals 8 178.388 22.298

This printout is probably more like what you will be putting into your manuscripts.Again, the trend does not seem to be significant.

Biological Data Analysis Using R

94 CHAPTER 6. LINEAR MODELS

Plotting the Regression Model onto Your Points

It is possible plot the regression model onto a display of the predictor and responsevariables. This can sometimes be helpful when visualizing your data. The abline functionoverlays a line on your current plot. To use the abline function on an existing graph doesnot require you to call par(new=T) first as it takes care of that already.

Figure 6.2: Regression model added to plot of points using abline function.

> plot (Y˜X, xlab="X" , ylab="Y" , bty="n" , col="red" ,pch=19,ylim=c (0 ,30) , xlim=c (0 ,10) )> abline ( f i t1 , l t y =2)

In addition to passing a variable that is a regression model (e.g., the class(fit1 ) = "lm"), thefunction abline can also be called by passing it raw values for the slope and intercept. Thismeans you can add an arbitrary line to any plot you like. As shown above, the functionalso takes additional parameters that allow you to customize the look of the line. Youmay want to revisit Table 4.1 as a reminder.

Biological Data Analysis Using R

6.2. REGRESSION WITH A SINGLE VARIABLE 95

Adding Text To A Graph

While we are customizing this image of our non-significant regression model, it is prob-ably a good time to look at the text () function. This function allows you to add arbitrarytext to your plot. The basic call of this function will include the x and y coordinates ofwhere you want to put the text and the characters string that you will be putting on thegraph.

To illustrate how this is done, we will add the regression formula to the plot. First, wewill determine where in the fit1 variable you can find the regression coefficients. Youcould type out the regression equation yourself and for a one-off image it may be easierfor you to do it this way, but if the data are already embedded in the fit1 variable then itis a more versatile approach for you to use.

> names( f i t 1 )[1 ] "coefficients" "residuals" "effects" "rank"[ 5 ] "fitted.values" "assign" "qr" "df.residual"[ 9 ] "xlevels" "call" "terms" "model"

> f i t 1$coe f f i c i ents( Intercept ) X16.3333333 0.9212121

> f i t 1$coe f f i c i ents [1 ]( Intercept )

16.33333> f i t 1$coe f f i c i ents [2 ]

X0.9212121

So we can access the values estimated for β0 and β1 using the fit1coefficients[1]andfit1coefficients[2].Now we need to make a single string that has the regression equation y = β0 + β1x. Thetext parts, we can write out but the variables should come from fit1. To do this, weuse the paste function. This function takes a list of items and mushes them togetherinto a single character string More can be found on the paste function and general stringmanipulation in Chapter 9.

> formula <− paste ( "y = " , f i t 1$coe f f i c i ents [1 ] , " + " , f i t 1$coe f f i c i ents [2 ] , "x" )> formula[1 ] "y = 16.3333333333333 + 0.921212121212122 x"> text (5 ,12.5 , formula )

6.2.1 Regression Diagnostics

It is possible to attempt to fit any model to a set of data. However, just because R willhappily (in most cases) provide you an answer to a model fitting, it does not mean thatit is the right model for the data. For example, your data may not be linear, however itis still possible for you to fit a line to non-linear data. R includes some easy methodsthat you can use to examine the appropriateness of your model and here we will focuson some of the built-in diagnostics. These focus on the single specified model and allowyou to make decisions on the appropriateness of your proposed model. Later in ?? wewill cover methods that allow you to determine if one model is better than another fordescribing your data.

Biological Data Analysis Using R

96 CHAPTER 6. LINEAR MODELS

Figure 6.3: Regression model with fitted line and formula.

One of the first things you should do when you specify a linear model is look at theresiduals. The residuals are the eij components of the model in the general formula.These represent the variation that is not explained by your fitted line. The things youare looking for in the residuals are:

1. Systematic changes in the residuals when plotted as a function of the predicted val-ues. This would indicate that there is something else that is changing the responsevariable that you are not taking into consideration.

2. Non-linearity in the residuals when plotted against the predicted values. This wouldsuggest that perhaps your data are not linear to start with and the fitting of a linearmodel to it may not be appropriate.

3. Normality of the residuals. These values are expected to be N(0, σ2). If they are not,it may not be appropriate to be fitting this model to your data.

4. Outliers. Do you have any evidence that once you fit your model to the data thatthere are particular entries that are obviously not part of the trend. There canbe many reasons for outliers. First, they may just be an outlier and it is a realobservation that should be kept in the model. However, it is also possible that

Biological Data Analysis Using R

6.3. MULTIPLE REGRESSION 97

Figure 6.4: A 2x2 matrix plot of some diagnostic tools associated with a linear model. They includea plot of the residuals (eij ) as a function of the fitted values (yi) to see if there are systematic biasesin the model (upper left), a Q-Q plot to examine normality of the residuals (upper right), a scalelocation plot (lower left), and a leverage plot to look for outliers (lower right).

there was an equipment malfunction, you entered the data point incorrectly intothe computer, etc. It is always good to check and see if you screwed up.

R provides a series of four plots for you to look at when you plot a variable specifiedby lm(). These plots are displayed in Figure ??. You can see these plots by using thecommand plot(fit1 ) (or whatever your model variable name is) and R will show you a seriesof plots examining the distribution of the residuals. For a more in depth discussion ofmodel verification you should probably consult a text book on regression analysis.

6.3 Multiple Regression

There are several occasions where we may be interested in how well several predictorvariables can explain the variation in a response variable. This is called multiple regres-sion and has a linear model with the form:

Biological Data Analysis Using R

98 CHAPTER 6. LINEAR MODELS

yi = β0 + β1X1 + β2X2 + . . .+ βkXk + e`

Here you have up to k different predictor variables, each of which contributing to theobserved value in y. When approaching a multiple regression,

The null hypothesis for a multiple regression is HO : βi = 0;∀i and states that all the betaregression terms are zero. To address this hypothesis, we build a linear model and thendetermine how much of the observed variation can be explained by the model in.

In R we can use the same lm function as for a single predictor regression but this timewe need change how we put the function equation into it to accommodate two variables.For this example, we can use the data shown in Table 6.3.

i Y X1 X2

1 4.26 1.00 0.892 20.74 2.00 0.413 14.95 3.00 0.724 -5.55 4.00 0.205 21.29 5.00 0.406 33.49 6.00 0.377 32.15 7.00 0.618 45.95 8.00 0.099 38.94 9.00 0.7410 48.27 10.00 0.69

These values can be put into R as:

> Y <− c (4.26 , 30.74, 14.95, −5.55, 21.29, 33.49, 32.15, 45.95, 38.94, 48.27)> X1 <− 1:10> X2 <− c (0.88 , 0.41 , 0.72 , 0.19 , 0.40 , 0.37 , 0.61 , 0.09 , 0.74 , 0.68)> cbind (Y,X1,X2)

Y X1 X2[1 , ] 4.26 1 0.88[2 , ] 30.74 2 0.41[3 , ] 14.95 3 0.72[4 , ] −5.55 4 0.19[5 , ] 21.29 5 0.40[6 , ] 33.49 6 0.37[7 , ] 32.15 7 0.61[8 , ] 45.95 8 0.09[9 , ] 38.94 9 0.74

[10 , ] 48.27 10 0.68

And then we can create a linear model using the notation lm( Y ˜ X1 + X2 ).

> f i t 2 <− lm( Y ˜ X1 + X2 )> summary( f i t 2 )

Call :lm( formula = Y ˜ X1 + X2)

Residuals :Min 1Q Median 3Q Max

−24.8394 −2.7430 −0.8989 4.1369 20.0461

Biological Data Analysis Using R

6.3. MULTIPLE REGRESSION 99

Coef f ic ients :Estimate Std . Error t value Pr (>| t | )

( Intercept ) 1.170 12.801 0.091 0.9297X1 4.460 1.422 3.137 0.0164 ∗X2 1.473 16.763 0.088 0.9324−−−Signi f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 12.85 on 7 degrees of freedomMultiple R−squared : 0.5857, Adjusted R−squared : 0.4673F−s ta t i s t i c : 4.948 on 2 and 7 DF, p−value : 0.04578

> anova ( f i t 2 )Analysis of Variance Table

Response : YDf Sum Sq Mean Sq F value Pr(>F)

X1 1 1631.66 1631.66 9.8875 0.01628 ∗X2 1 1.27 1.27 0.0077 0.93244Residuals 7 1155.16 165.02−−−Signi f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

As we can see, the estimates for β0 = 1.17, β1 = 4.46, and β2 = 1.47. Overall, it appears thatthe only term that has is like to not be zero is the term for β1 for variable X1. However,even with the β0 and β2 terms in the model for the intercept and the slope coefficient forthe variable X2, the overall mode is significant (see the anova table).

Adding Interactions

Some times it is preferable to run models that show the interaction between variablesas well as the influence of individual variables. This is appropriate when you have somereason to believe that the combination of predictor variables will influence the responsein a non-additive method. The linear model for this is:

yij = µ+ β1X1 + β2X2 + β3(X1X2) + eij

where the β3 coefficient determines the strength of the interaction. If β3 = 0 then thereis no interaction.

In R interaction terms are indicated by the colon operator. For example, the full modelin our example data with the interaction would be specified as

> f i t 2 <− lm( Y ˜ X1 + X2 + X1:X2 )> summary( f i t 2 )

Call :lm( formula = Y ˜ X1 + X2 + X1:X2)

Residuals :Min 1Q Median 3Q Max

−22.882 −2.267 −1.007 4.168 22.401

Coef f ic ients :Estimate Std . Error t value Pr (>| t | )

( Intercept ) −8.500 26.951 −0.315 0.763X1 6.204 4.459 1.391 0.213

Biological Data Analysis Using R

100 CHAPTER 6. LINEAR MODELS

X2 16.270 39.803 0.409 0.697X1:X2 −2.732 6.569 −0.416 0.692

Residual standard error : 13.68 on 6 degrees of freedomMultiple R−squared : 0.5973, Adjusted R−squared : 0.3959F−s ta t i s t i c : 2.966 on 3 and 6 DF, p−value : 0.1192

> anova ( f i t 2 )Analysis of Variance Table

Response : YDf Sum Sq Mean Sq F value Pr(>F)

X1 1 1631.66 1631.66 8.7194 0.02552 ∗X2 1 1.27 1.27 0.0068 0.93692X1:X2 1 32.37 32.37 0.1730 0.69192Residuals 6 1122.78 187.13−−−Signi f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

There is a shorthand method that indicates that you are interested in having all interac-tions between predictor variables and that is:

> f i t2Alternate <− lm( Y ˜ X1∗X2 )> summary( f i t2Alternate )

Call :lm( formula = Y ˜ X1 ∗ X2)

Residuals :Min 1Q Median 3Q Max

−22.882 −2.267 −1.007 4.168 22.401

Coef f ic ients :Estimate Std . Error t value Pr (>| t | )

( Intercept ) −8.500 26.951 −0.315 0.763X1 6.204 4.459 1.391 0.213X2 16.270 39.803 0.409 0.697X1:X2 −2.732 6.569 −0.416 0.692

Residual standard error : 13.68 on 6 degrees of freedomMultiple R−squared : 0.5973, Adjusted R−squared : 0.3959F−s ta t i s t i c : 2.966 on 3 and 6 DF, p−value : 0.1192

You can see that this gives the exact same response. You should be careful with thisnotation when you are working with several predictor variables because it will do all thelinear interactions including the three- and four-way (and higher) ones if you have thatmany variables. This may or may not be what you are interested in testing.

Models Without Intercept Terms

Some times it is of interest to test the fit of a model that does not have an interactionterm. Perhaps you have already subtracted the mean of the response variable y = y − yand as such there is not predicted to be any interaction, or as in the case of our model inthe previous section, perhaps the model does not support the addition of an interactionterm. At any rate, it is possible to indicate to the lm function that you want to run theanalysis without estimating the interaction. The linear model for this would be:

Biological Data Analysis Using R

6.3. MULTIPLE REGRESSION 101

yi = β1X

The formula that you pass to lm( Y ˜ X − 1). The -1 addition to the function is the part thattells R how to run properly. Running the data again but only including the variable X1

and the response variable Y without the interaction term gives:

> f i t 3 <− lm(Y ˜ X1 − 1 )> summary( f i t 3 )

Call :lm( formula = Y ˜ X1 − 1)

Residuals :Min 1Q Median 3Q Max

−24.4756 −2.0177 0.1422 4.0652 21.2772

Coef f ic ients :Estimate Std . Error t value Pr (>| t | )

X1 4.7314 0.5798 8.16 1.89e−05 ∗∗∗−−−Signi f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 11.38 on 9 degrees of freedomMultiple R−squared : 0.8809, Adjusted R−squared : 0.8677F−s ta t i s t i c : 66.59 on 1 and 9 DF, p−value : 1.889e−05

> anova ( f i t 3 )Analysis of Variance Table

Response : YDf Sum Sq Mean Sq F value Pr(>F)

X1 1 8618.7 8618.7 66.587 1.889e−05 ∗∗∗Residuals 9 1164.9 129.4−−−Signi f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Overall, this model explains much more of the variation that the full model lm(Y ˜ X1 + X2 )

or the interaction model lm(Y ˜ X1∗X2), just compare the Multiple R-Squared values.

6.3.1 Comparing Models

So in the previous subsection we have developed three different models that we haveproposed to explain our data. They are, in the order of reverse complexity, give as:

• The full model with the interaction terms lm( Y ˜ X1 + X2 + X1:X2).

• The full model without the interaction terms lm( Y ˜ X1 + X2 ).

• The partial model with only X1, lm(Y ˜ X1).

• The minimal model with only X1 and without an intercept term, lm(Y ˜ X1 − 1).

There are several methods that you should use to determine which of these models youwould like to consider to be the most appropriate.

1. Look at the overall anova significance. If the overall models are not significant, thenthere is no use in discussing them. In our examples, the full interaction model wasnot significant and should be disregarded.

Biological Data Analysis Using R

102 CHAPTER 6. LINEAR MODELS

2. Examine the relative significance of each of the terms in the models as is shownby the summary function. This can give some indication of which terms may beimportant. Our various models suggested that the predictor variable X2 did nothelp in explaining the variation in the response variable.

3. Look at the relative R-squared values. These indicate the proportion of variationexplained by the model and are given by the summary function.

4. Use a statistically based method to test the differences between two models suchas:

anova You can use the anova function and pass it two models that have been fit tothe same data and it will perform an analysis to see if the additional term(s)are significant. Here is an example using the models having only the variableX1 to see if the addition of the intercept term is significant.

> anova ( f i t3 , f i t 4 )Analysis of Variance Table

Model 1: Y ˜ X1 − 1Model 2: Y ˜ X1

Res . Df RSS Df Sum of Sq F Pr(>F)1 9 1164.912 8 1156.43 1 8.48 0.0587 0.8147

AIC There are other statistical methods that you can use to see if the additionalterms are significant in your model. One of these is the stepwise method usingthe AIC (Akaike Information Criterion). In R you can do this by passing thelargest model to the function step and it will perform the analysis for you. TheAIC statistics will decrease as the estimated predictive power of your modelincreases. So you want to look for the smallest values of AIC. Here is anexample using the full model (including the interaction).

Start : AIC=55.21Y ˜ X1 + X2 + X1:X2

Df Sum of Sq RSS AIC− X1:X2 1 32.37 1155.16 53.49<none> 1122.78 55.21

Step : AIC=53.49Y ˜ X1 + X2

Df Sum of Sq RSS AIC− X2 1 1.27 1156.43 51.51<none> 1155.16 53.49− X1 1 1624.25 2779.40 60.27

Step : AIC=51.51Y ˜ X1

Df Sum of Sq RSS AIC<none> 1156.43 51.51− X1 1 1631.66 2788.09 58.31

Call :lm( formula = Y ˜ X1)

Coef f ic ients :( Intercept ) X1

Biological Data Analysis Using R

6.4. ANALYSIS OF VARIANCE 103

1.989 4.447

As you can see, the AIC values decrease until the final model which only hasthe X1 term and is missing an intercept.

You should consider a wide range of these methods when attempting to put together agood regression model.

6.4 Analysis of Variance

The analysis of variance is a common method for examining the equality of observationsthat can be partitioned into categorical treatments. In all reality, an ANOVA is simplya regression with categorical predictor variables (e.g., the values of x are not continu-ous).

6.4.1 1-Way ANOVA

The simplest ANOVA model is one in which a single treatment has been applied and youhave collected a single set of observations. The linear model can be presented as:

yij = µ+ τi + eij

where the τi is the treatment effect. You can think of this as the deviation from theoverall mean that can be attributed to an observation being in a particular treatment.The eij term is again the error term.

In 5.3.1, we used the Pinus echinata germination data to illustrate how to perform aKruskal-Wallis test. At that time, I had suggested that the Kruskal-Wallis test was arank-based version of an analysis of variance (ANOVA). Here will use the same dataagain to demonstrate the parametric equivalent of the Kruskal-Wallis test; the one-wayANOVA.

As a reminder, the data consist of family germination rates for Pinus echinata (perhapsone of the homeliest looking conifer in existence) separated by timber treatment. In theOzark mountains of Missouri, control, selectively cut, and clear cut treatments were ap-plied to previously continuous forest stands. No P. echinata individuals were removed soin essence the treatments were modifications of other species around the resident pines.A summary of germination data is presented in Figure ?? showing the average germi-nation rate lowest in the control stands and highest in the stands where heterospecificswere selectively removed from around the target species.

The null hypothesis for this model is: HO : NoTreatmentEffects (which is like sayingτControl = τSelective = τClearCut).

> pineData <− read . table ("PineGerminationData.txt" ,header=T )> anova1 <− aov ( GERM ˜ TRT, data=pineData )> anova1Call :

Biological Data Analysis Using R

104 CHAPTER 6. LINEAR MODELS

Figure 6.5: Boxplot of germination percentages for Pinus echinata as a function of treatment. Acolored rug was added to the right side to show the actual values within treatments (see rug.

aov ( formula = GERM ˜ TRT, data = pineData )

Terms:TRT Residuals

Sum of Squares 0.8717943 2.6520868Deg. of Freedom 2 50

Residual standard error : 0.2303079Estimated e f f ec ts may be unbalanced> anova ( anova1 )Analysis of Variance Table

Response : GERMDf Sum Sq Mean Sq F value Pr(>F)

TRT 2 0.87179 0.43590 8.218 0.0008207 ∗∗∗Residuals 50 2.65209 0.05304−−−Signi f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

From these results, we can see that there is a treatment effect, and it appears to behighly significant. But in looking at the plot in Figure 6.5 are these results supposed tolead us to believe that all the treatments are significantly different or just some subset

Biological Data Analysis Using R

6.4. ANALYSIS OF VARIANCE 105

of them?

One way to get to this is to look at the 95% confidence intervals for the treatment meansand see if they overlap. One way to do this is to use the Tukey Honest SignificantDifferences (or TukeyHSD) function. This function takes the aov analysis as an argumentand prints out the confidence intervals for the differences in the means of the treat-ments.

Figure 6.6: Confidence intervals for difference in mean germination rates for Pinus echinata fam-ilies.

> postHoc <− TukeyHSD( anova1 )> postHoc

Tukey multiple comparisons of means95% family−wise confidence leve l

Fi t : aov ( formula = GERM ˜ TRT, data = pineData )

$TRTd i f f lwr upr p adj

CTRL−CLR −0.27927536 −0.46389755 −0.09465318 0.0017640SEL−CLR −0.04566667 −0.24879523 0.15746190 0.8504882SEL−CTRL 0.23360870 0.04898651 0.41823088 0.0098768

> plot ( postHoc )

Biological Data Analysis Using R

106 CHAPTER 6. LINEAR MODELS

The postHoc anlaysis can also be plotted by calling plot( postHoc ) showing the confidencein the differences in treatment levels (those that overlap the zero are not significantlydifferent) as presented in Figure 6.6. These results suggest that the significance in theANOVA model is due to the differences between the control and the other two treatmentsand that both of the cutting treatments had essentially the same germination rate (justlarger than families in the control stands).

Biological Data Analysis Using R

6.5. USEFUL FUNCTIONS 107

6.5 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• abline(x) Draws a line on the currently active graphics device. You can either specifythe intercept and slope or pass this a fitted linear model.

• anova(x) Creates the Analysis of Variance Tables for the models passed in x.

• aov(x) Performs the analysis of variance on the formula in x.

• cbind(x,y,... Puts the variables x, y, ... into a single column-bound variable.

• lm(func) Tests the model func using linear least-squares.

• t.test(x) This function performs the t-test for either a single data set and a predictedmean or a paired t-test using two data sets.

• round(x) Rounds the value of x to the nearest integer.

• pch Optional parameter for the plot function that will designate the type of symbolplotted using the plot command.

• runif(n,mn,mx) Returns n random numbers drawn uniformly from the range [mn,mx].

• step(x) Evaluates the terms in the model x for inclusion in the model using the AICcriteria.

• summary(x) Provides description of x.

• text(x,y,c Plots the text in c on the graph at the coordinates (x, y).

• TukeyHSD(x) Performs Tukey’s Honest Significant Difference post-hoc test on the aov

model in x.

Biological Data Analysis Using R

108 CHAPTER 6. LINEAR MODELS

6.6 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. Load the data set Temperature.csv from the chapter folder. These data represent themeasured brood chamber temperature for a wood-boring beetle. Test the hypothesisHO : Mean temperature is 61 .

2. Load the data set ClutchSizes.csv from the file. Using a paired t-test, test the hypoth-esis HO : There is no difference in reproductive output between habitat types.

3. Load the data file, SingleRegresssion.RData from the file into R 2. Fit the regressionmodel, Y ˜ X. Is it significant? Show the regression equation and the anova table.

4. Plot the regression model from the previous example and indicate the fitted regressionline with a dotted red line in the plot.

5. From the single regression model, add the regression equation to the graph indicatingthe β coefficients that were estimated.

6. Does a plot of the residuals as a function of the predicted values from the estimatedregression model suggest that the model is appropriate?

7. Load the data set MultipleRegression.RData from the file, it will contain a data framenamed multReg. Use the variables in this data frame, Y,X1,X2,X3 to fit a multiple re-gression model. Show the summary and the anova table in your results. What is thepredicted regression equation?

8. Fit another model to the multReg data that has all the interaction terms amongst theX predictor variables. Use the anova procedure to see which of these models is moreappropriate.

9. Load the data file VarroaCounts.RData, it will be a data frame named BeeData. Thesedata represent counts of the parasite Varroa destructor a common pest of domesti-cated honey bees. Test the hypothesis using an analysis of variance that there is nodifference in mite counts between the different lines of bees.

10. Perform the TukeyHSD test on the parasite data from the previous question.

2Use the load function.

Biological Data Analysis Using R

Chapter 7

Working With Images

In this chapter, you will focus on the following topics:

• Gain a basic understanding of open image formats

• Learn how to import image data into R

• Manipulate image data at the pixel level.

7.1 Image Data

There are several different methods that are available to you to import image data into R. As I was writing this document over Winter break and updating it in the fall, the mainimage processing library for R , rimage, was broken and could caused a few problemswhen installed. I am sure it will be fixed in the near future and recommend that youlook at that library when you next have the need to do some image manipulation becauseit has a lot of funcitonality. However, at the present, it is not going to be used. Theconsequences of not having rimage is that it appears that importing jpeg, tiff, and bmp

image formats is beyond our grasp. Lucky for us, there are a ton of other image formatsout there and we can easily convert the image shown in Figure 11.1 into another formatand use it just as easily. Perhaps when I update this manuscript the next time around,I’ll change this section. I think it is also important that you understand the internalworkings of images and for right now, these more simple image formats will serve ourpurposes nicely and everything you learn here will be easily transferable to those otherimage formats when you need to deal with them in the future.

7.1.1 PNM Image Format

Images on computers have specific formats in which the color information and othermeta data is stored in the file. Some of the methods are relatively easy to use and can bemanipulated directly in a text editor. Others are more of a pain and some are ”owned” bysome company who has patented the way the information is stored in the file and youhave to pay royalties to them to view it. For example, the ubiquitous GIF image format

109

110 CHAPTER 7. WORKING WITH IMAGES

uses an algorithm that was patented and owned by a company and if you were to writea viewer for it in some countries you would have to pay a royalty to use it... Lame.

The PNM image format (short for portable anymap) is an open format for the exchangeof image information. Actually, there are three different formats that fall under the PNMspecification as detailed below.

Portable Bitmap Format (PBM)

This format stores bitmaps images. A bitmap can be thought of as an image whose pixelsare either turned on or off (say black and white). The representation of a PBM file can begiven as a simple text file with the extension .pbm. An example text file for a bitmap filethat encodes for the uppercase letter R would be:

P1# This is an example bit map file r.pbm5 81 1 1 1 01 0 0 0 11 0 0 0 11 0 0 0 11 1 1 1 01 0 0 1 01 0 0 0 11 0 0 0 1

In this file, the first line is a special code to tell the computer how many bits per pixel touse. The second line is a comment line that you can put anything you like into (but hasto start with the # character). The third line tells how many columns and rows of datathat the image has. Note, this is a column-major notation here where the first numberis the number of columns and the second number is the number of rows, which is theopposite of which we use (row-major) in R for interacting with matrices of data. The restof the file consists of the actual bit matrix where 1 represents a pixel that is turned onand 0 represents a pixel that is turned off. The image represented in this file is given inFigure 7.1.

You can make this image programatically, by creating the matrix in R and using theimage function. Here is an example creating the image of the letter T.

> x <− matrix (0 ,nrow=8,ncol=5)> x

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ] [ , 5 ][1 , ] 0 0 0 0 0[2 , ] 0 0 0 0 0[3 , ] 0 0 0 0 0[4 , ] 0 0 0 0 0[5 , ] 0 0 0 0 0[6 , ] 0 0 0 0 0[7 , ] 0 0 0 0 0[8 , ] 0 0 0 0 0> x [1 , ] <− 1> x [ ,3 ] <− 1> x

Biological Data Analysis Using R

7.1. IMAGE DATA 111

Figure 7.1: The image represented in the r.pbm file. This image has been scaled up to make itlarge enough to see it on the page using the program GIMP (www.gimp.org).

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ][1 , ] 1 1 1 1 1[2 , ] 0 0 1 0 0[3 , ] 0 0 1 0 0[4 , ] 0 0 1 0 0[5 , ] 0 0 1 0 0[6 , ] 0 0 1 0 0[7 , ] 0 0 1 0 0[8 , ] 0 0 1 0 0> colors <− c ("black" ,"grey" )> image ( x , col=colors , axes=F)

Here I created the matrix that had all 0 in it and set the top row and the middle columnequal to 1. Then the image function was used to plot it. The image function takes a numberof optional arguments and here I have supplied it the colors and the option to not showthe axes. Since I have two values in the matrix, a two element vector will be sufficient tohandle all the different colors. The image shown in Figure 7.2 shows this matrix. Thereseems to be a small problem with it in that it is rotated 90 counter-clockwise. This isbecause the origin of the plot that is created by the image function is in the lower left-handcorner. Conversely, most images that are stored on the computer (like the desktop imagein the background), assume that the origin is at the upper left hand corner of the image.Obviously these two do not mesh well together.

Portable Graymap Format (PGM)

This format is for graymap images where the term graymap refers to the lack of colorin the image. In terms of complexity, this is slightly more information contained in thedata file as each pixel is not either ON or OFF, rather there is a percentage of ONNESS... (isthat a word?).

P2# The PGM file for dog.pgm24 75

Biological Data Analysis Using R

112 CHAPTER 7. WORKING WITH IMAGES

Figure 7.2: A PBM file that was programatically created in R . The image is rotated because of thedefault location of the origin.

0 1 1 1 1 0 0 0 0 0 0 5 5 5 0 0 0 0 0 4 4 4 0 00 1 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 4 0 0 0 4 00 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 0 0 00 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 0 0 00 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 4 4 00 1 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 4 0 0 0 4 00 1 1 1 1 0 0 0 0 0 0 5 5 5 0 0 0 0 0 4 4 4 0 0

The first three lines of the file are the same as for the PBM format. The fourth line in thefile gives the maximum value representing the the most white in the image. In this case,the a black pixel will be represented by the number 0 and the white would be representedby 5 and values in between would be 1

5 increments of whiteness. The remaining portionsof the file have the actual image represented in a pixel-by-pixel matrix of values. Youcan see that the majority of the image is 0

5 black and the letters are varying shades ofgray (Figure 7.3).

The number of shades of gray you use in a PGM file is up to you as long as it does notexceed 255 (I think). These are easy files to create and you could imagine how you could

Biological Data Analysis Using R

7.1. IMAGE DATA 113

Figure 7.3: The image represented by the dog.pgm file. This image has been scaled up to make itlarge enough to see it on the page using the program GIMP (www.gimp.org).

create a matrix of integers from some analysis and save it as a pgm file and view itdirectly.

Portable Pixmap Format (PPM)

The last file format, PPM, is one that handles pixmaps, which means that you have coloredpixels in the image. The file format is identical to that of the PGM with the exception thatthe code on the first line is P3, which represents 24-bits per pixel; 8 of which are for red,8 for green, and 8 for blue. An example of the PPM file shown in Figure 7.4 is:

P3# This image contains an image of my daughter Libbie (from Libbie.ppm).180 240255188219253189220252

In this file, the pixel values are placed one per line instead of next to each other. Startingat line number 5 with a value of 188 the following 180x240 = 43, 200 lines contain aninteger whose value is between 0 and 255 (the maximum all color as depicted on line4) for the color red followed by another 43, 200 lines of numbers for the color green, andthen another 43, 200 lines for the blue. When we begin looking at manipulating imagesyou will find that you can interact with each color channel independently.

One drawback to these image formats are that they are not very efficient. For example,the image of my daughter in Figure 7.4 has 129, 604 lines of information in it, which onmy computer makes it 465K in size. The exact same image saved as a jpeg file is only25K in size. The compression used to make jpeg, tiff, gif, png, and other compressed fileformats is why they are used on the internet. But for our purposes, the lack compressionand inefficiency in storages sizes are relatively irrelevant.

Biological Data Analysis Using R

114 CHAPTER 7. WORKING WITH IMAGES

Figure 7.4: The image represented in the Libbie.ppm file. This image has been scaled up to makeit large enough to see it on the page using the program GIMP (www.gimp.org).

7.2 Loading The Image Into R

OK, now that the basics of how one kind of image is represented in the data files, it istime to load one into R and see what we have to work with. To load a PNM file, you mustfirst import the pixmap library then you can use the function read.pnm() to load the file intoa local variable and plot it using the plot () function.

> l ibrary (pixmap )> photo <− read .pnm( f i l e ="Libbie.ppm" )Read 129600 items> plot ( photo )

The plot () function will open a new image window and show the loaded image.

7.3 Components of A Pixmap

We can learn a little bit more about what kind of data type the variable we call photo isby using the class() function.

> class ( photo )[1 ] "pixmapRGB"attr ( ,"package" )[ 1 ] "pixmap"> names( attr ibutes ( photo ) )[ 1 ] "size" "cellres" "bbox" "bbcent" "channels" "red" "green"[ 8 ] "blue" "class"

Biological Data Analysis Using R

7.4. IMAGE OPERATIONS 115

This variable is a pixmapRGB class that comes from the pixmap package. A class is a self con-tained data structure that has both attributes and data. The command names(attributes(photo))

tells us the names of the attributes that the variable has.

There are some issues that we should touch on when dealing with classes. They differfrom what we have been using thus far such as data frames in that we cannot accessthe contents of a class using the $ notation. This is because things like lists and dataframes are not classes, they are just objects. To access attributes of classes we use thenotation. For example:

> photo@size[1 ] 240 180> photo@channels[1 ] "red" "green" "blue"> dim( photo@red )[1 ] 240 180> photo@red [1 ,1][1 ] 0.7372549> range ( photo@red )[1 ] 0 1

Here we can get to the size, channels, and red components of the class directly. We can alsosee that the red channel that determines the amount of redness in each pixel has beenstandardized on the range [0, 1]. This is important to know if we are going to manipulatethe image directly.

7.4 Image Operations

7.4.1 Extracting Channels

So now we know how to make some alterations of the image and see what happens. Inthe next example, I first copy the photo to make three additional photos, named redPhoto,bluePhoto, and greenPhoto. Then for each of the new variables I remove all the data in eachof the corresponding channels by making the channel contain a matrix of zeros the samesize as the original matrix.

> redPhoto <− photo> bluePhoto <− photo> greenPhoto <− photo> redPhoto@size[1 ] 240 180> redPhoto@blue <− redPhoto@green <− matrix (0 ,nrow=240,ncol=180)> bluePhoto@red <− bluePhoto@green <− matrix (0 ,nrow=240,ncol=180)> greenPhoto@red <− greenPhoto@blue <− matrix (0 ,nrow=240,ncol=180)> par (mfrow=c (1 ,4 ) )> plot ( photo )> plot ( redPhoto )> plot ( greenPhoto )> plot ( bluePhoto )

Note that I used the sequential assignment A <−B <−C <−D as a shorthand here. This willassign the value of D to the variable C then C to B and then B to A. This a lazy trick but onethat you will probably use as it saves a bit of time and typing.

Biological Data Analysis Using R

116 CHAPTER 7. WORKING WITH IMAGES

Then I make a 1x4 matrix of plots so that I can plot all four images in the same frame (see?? for more on how this is done) and in each of the four slots, I plot one of the imagesyielding a figure similar to what is presented in Figure 7.5.

Figure 7.5: The original image along with ones where only the red, green, and blue channel turnedon.

In some cases, it is helpful if you can extract the color information and generalize theimage as a greyscale image (as you will in Chapter 11). Here we use the informationfrom each channel, weighed equally, in the creation of the image.

> gphoto <− pixmapGrey ( photo@red+photo@blue+photo@green )> plot ( gphoto )> names( attr ibutes ( gphoto ) )[ 1 ] "size" "cellres" "bbox" "bbcent" "channels" "grey" "class"> gphoto@grey [1 ,1][1 ] 0.8627451> range ( gphoto@grey )[1 ] 0 1

The function pixmapGrey() takes a matrix of data, of which we just use the element-wiseaddition of each channel in the color photo. You can also see that in the creation of thenew grey image, the values were again standardized.

For the moment, lets examine the contents of this grey image and play around with it abit. Lets make it a bit darker by shifting all the grey values down (to make it more black).We can do this by performing operations on the matrix of grey values in the class. Forsimplicity, I will make a copy of the image first and then perform operations on the copyrather than the original one. Then we will look at the distribution of grey values thatmake the image.

> darkerGphoto <− gphoto> darkerGphoto@grey <− darkerGphoto@grey / 2> par (mfrow=c (1 ,3 ) )> plot ( gphoto )> hist ( gphoto@grey , xlim=c (0 ,1 ) , xlab="Grey" ,main="" )> plot ( darkerGphoto )

We can see that the vast majority of values are towards the light end of the distribution.To darken this up, we should scale these values to be closer to zero by dividing them by2 and then replotting the image to see the result (see results in Figure 7.6).

Biological Data Analysis Using R

7.5. CREATING IMAGES PROGRAMATICALLY 117

7.5 Creating Images Programatically

Images can be made programatically once you understand how images are represented.There are some helper functions that can help you in creating new images. For thepurposes of this section, we will focus on greyscale images and allow the analysis ofcolored images for you to play with on your own time.

Lets start by making an image where each pixel is randomly assigned a greyscale value.For convenience, I’ll make it the same size as the photo named gphoto from 7.4.1.

> randomImageMatrix <− matrix ( rnorm(240∗180) ,nrow=240,ncol=180)> gray <− grey (1:100/100)> image ( randomImageMatrix , col=gray )

Here I use the rnorm() function to create 240 ∗ 180 = 43, 200 random numbers in a matrixthat has 240 rows and 180 columns. I then use the grey() function to create 100 differentshades of grey ranging from white to black at equal intervals. When the image is made,the range of random numbers is used to divide the pixels into the 100 different greycolors (e.g., the image() function scales the values in randomImageMatrix into length(gray) distinctgroups for plotting). The results is shown in Figure 7.7.

This image can be manipulated by changing the values in the matrix randomImageMatrix. Inthe next example, I replace the center 40x40 block with the white (which would be thelargest value from randomImageMatrix).

> randomImageMatrix[100:140,70:110] <− max( randomImageMatrix )> image ( randomImageMatrix , col=gray )

The result is shown in Figure 7.8 resembling a square doughnut (mmmm doughnuts...).

Figure 7.6: The greyscale translation of the PPN image, a histogram of the grey values and theimage resulting from reducing all the grey values in the image by half.

Biological Data Analysis Using R

118 CHAPTER 7. WORKING WITH IMAGES

Figure 7.7: A random image Figure 7.8: A random image with a squaredoughnut hole in the middle.

7.6 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• cat() This function dumps the passed arguments out to the terminal.

• grey(x) This function returns the grey color associated with the value of x. It isassumed that that 0 ≤ x ≤ 1.

• image(x) Can be used to create an image as either grey or colors for the values in thematrix x.

• max(x) Returns the maximum value contained in x.

• rnorm(x) Returns x random numbers from a N(µ, σ).

Biological Data Analysis Using R

7.7. EXERCISES 119

7.7 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. Create a Portable Bitmap Format file (*.pbm) exactly like the one that is shown for theletter R but make it represent the letter L.

2. Why is Figure 7.2 not right-side-up?

3. Make your L image correct by changing the values of the underlying matrix such thatwhen it is plot using the image command it is in the correct orientation.

4. What is the purpose of the PX number on the first line of the PNM file formats?

5. Load your own copy of the image Libbie.ppm into R using the read.pnm function asdemonstrated in the Chapter. Create three copies of the image and for each copyremove the values in one channel (e.g., make one of the color matrices a zero). Plotthese images in a three-paned graphic using the function par(mfrow=c(1,3) option.

6. Replot the randomImageMatrix using a color palette instead of the grey palette shown.(Hint: See ?rainbow for five of the stock palettes available to you.)

7. What is the default palette used in the image plot function?

8. What is the purpose of the optional argument bbox in the pixmapGrey function?

9. Create the greyscale version of the image shown in the leftmost box in Figure 7.6.The grey channel is composed of greyscale values that must be between [0, 1]. Canyou invert the colors in this image? (Hint: If you can’t figure out how to do this, seethe footnote at the end of this sentence but only as a last resort.1

10. Why do you have to use the @ notation to access components of the pixmaps in thischapter?

1Are you sure you want a hint? Take 1 minus the grey channel to make the values flipped in the [0, 1]interval.)

Biological Data Analysis Using R

120 CHAPTER 7. WORKING WITH IMAGES

Biological Data Analysis Using R

Chapter 8

Matrix Analysis

Matrices are used in a wide variety of biological studies. In this Chapter I will use theexample of stage-classified matrix models to introduce you to how matrix manipulationoperates in R . There are some issues that need to be addressed with respect to basicoperations on matrices that if you haven’t had a course on Matrix Algebra, you may notfully appreciate.

In this chapter, you will focus on the following topics:

• Understand matrix operations in R .

• Create stage-classified matrix models.

8.1 Matrices In R

As shown in 2.4.9, a matrix is a fully recognized data type in R . In fact, R does awonderful job of working with matrices and is much faster at doing vector and matrixoperations directly than looping through matrices of values using a for()-loop (see 11.1for a complete discussion of looping R ).

In specific terms for this Chapter, a matrix can be defined as a 2-dimensional object thatholds numeric values. Matrices can be created by hand using the matrix() function andthe elements within them can be accessed using the square bracket notation (e.g., X[i,j])as:

> X <− matrix (0 ,nrow=4,ncol=4)> X[1 ,2] <− 23> X[1 ,4] <− 42> X

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ][1 , ] 0 23 0 42[2 , ] 0 0 0 0[3 , ] 0 0 0 0[4 , ] 0 0 0 0

You can also wrap the as.matrix() function around the read.table() function and read thedata from a matrix in a file into a variable directly. For a review of these two func-

121

122 CHAPTER 8. MATRIX ANALYSIS

tions see 2.4.9 and 3.1.2. In the online data sets for this chapter, there is a file calledExampleMatrix.csv that was exported from a spreadsheet. If

> A <− as . matrix ( read . table ( "ExampleMatrix.csv" , header=F, sep="\t" ) )> A

V1 V2 V3 V4 V5 V6 V7 V8 V9[1 , ] 0.00000 2.00000 2.00000 5.00000 4.00000 2.00000 7.00000 2.603310 2.000000[2 , ] 2.00000 0.00000 4.00000 6.00000 3.00000 4.00000 7.00000 3.603310 4.000000[3 , ] 2.00000 4.00000 0.00000 6.00000 4.00000 3.00000 7.00000 1.603310 1.000000[4 , ] 5.00000 6.00000 6.00000 0.00000 3.00000 1.00000 1.00000 3.694210 6.000000[5 , ] 4.00000 3.00000 4.00000 3.00000 0.00000 3.00000 4.00000 1.966940 4.000000[6 , ] 2.00000 4.00000 3.00000 1.00000 3.00000 0.00000 2.00000 2.148760 3.000000[7 , ] 7.00000 7.00000 7.00000 1.00000 4.00000 2.00000 0.00000 4.694210 7.000000[8 , ] 2.60331 3.60331 1.60331 3.69421 1.96694 2.14876 4.69421 0.000000 0.603306[9 , ] 2.00000 4.00000 1.00000 6.00000 4.00000 3.00000 7.00000 0.603306 0.000000

[10 , ] 4.00000 5.00000 4.00000 4.00000 4.00000 2.00000 3.00000 3.421490 4.000000[11 , ] 3.00000 5.00000 3.00000 5.00000 6.00000 2.00000 4.00000 3.603310 3.000000[12 , ] 3.00000 4.00000 3.00000 5.00000 3.00000 3.00000 6.00000 1.421490 2.000000

V10 V11 V12[1 , ] 4.00000 3.00000 3.00000[2 , ] 5.00000 5.00000 4.00000[3 , ] 4.00000 3.00000 3.00000[4 , ] 4.00000 5.00000 5.00000[5 , ] 4.00000 6.00000 3.00000[6 , ] 2.00000 2.00000 3.00000[7 , ] 3.00000 4.00000 6.00000[8 , ] 3.42149 3.60331 1.42149[9 , ] 4.00000 3.00000 2.00000

[10 , ] 0.00000 1.00000 3.00000[11 , ] 1.00000 0.00000 4.00000[12 , ] 3.00000 4.00000 0.00000

There are a few things to notice here:

1. R wraps values for matrices so that only a portion of each row can be viewed at atime.

2. The columns of data that were read in the file did not have a header row so Rassigned them the values V1 - V12. This is the default behavior.

3. If there is one value in the matrix that has a decimal portion to it, all the values willbe displayed with the same number of decimal places (e.g., compare the matrix Xand A from the two listings.

8.1.1 Matrix Arithmetic

Matrices have their own special kind of arithmetic that you may not be aware of, so hereis a very short course. For the following examples, I will be using the matrices X1, Y,and Z as defined by the R commands:

> X <− matrix (1:9 ,nrow=3,byrow=TRUE)> X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 2 3[2 , ] 4 5 6

1For matrices I will use upper case bold letters for variable names in the text to make it easier to distin-guish them from non-matrix variables as you read along. Obviously, this is not possible in R itself but forthe text hopefully this will make it easier to follow.

Biological Data Analysis Using R

8.1. MATRICES IN R 123

[ 3 , ] 7 8 9> Y <− matrix (9:1 ,nrow=3)> Y

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 9 6 3[2 , ] 8 5 2[3 , ] 7 4 1> Z <− matrix (1:12 ,nrow=4)> Z

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 5 9[2 , ] 2 6 10[3 , ] 3 7 11[4 , ] 4 8 12

One of the main things you have to pay attention to when dealing with matrices is thenumber of rows and columns in the matrices. In these example matrices, X and X aresquare matrices (e.g., they have the same number of rows and columns whereas X isnot square as it has 4 rows and 3 columns of data. To access the number of rows andcolumns in a matrix you must use the function dim().

Scalar Addition & Subtraction

Matrices may be shifted by the addition or subtraction of a constant scalar value (e.g.,2 + X). Scalar addition and subtraction take the value of the scalar and add it to everyelement in the matrix.

> X[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> X + 3

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 4 5 6[2 , ] 7 8 9[3 , ] 10 11 12

Matrix Addition & Subtraction

For both addition and subtraction of matrices, the numbers of rows and columns mustbe identical. If they are, the addition and/or subtraction operation results in the elemente-wise addition of each matrix. In R you can use the normal addition (+) and subtraction(-) operators as demonstrated below.

> X+Y[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 10 8 6[2 , ] 12 10 8[3 , ] 14 12 10

But when they are not the same size, R will barf up an error message to you telling youthey are not amenable to this operation.

Biological Data Analysis Using R

124 CHAPTER 8. MATRIX ANALYSIS

> X+ZError in X + Z : non−conformable arrays

Scalar Multiplication

The values within a matrix may be scaled by the multiplication of a scalar value (e.g., 0.5∗X). Scalar multiplication results in every single element in the matrix being multipliedby the scalar value. For example:

> X[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> X ∗ 2

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 2 4 6[2 , ] 8 10 12[3 , ] 14 16 18

Element-wise Multiplication

It is possible to multiply two matrices where what you are wanting is a new matrix thatis the element-wise product of each of the original matrices. This is sometimes calledthe Hadamard product or the Schur product. In R this operation is conducted usingthe regular multiplication character, *, between the two matrices. The result of thisoperation is a new matrix, the same dimensions as the two original ones.

> X[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> Y

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 9 6 3[2 , ] 8 5 2[3 , ] 7 4 1> X ∗ Y

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 9 12 9[2 , ] 32 25 12[3 , ] 49 32 9

Multiplication

Matrix multiplication is slightly more complicated than multiplication among scalars ormultiplying a scalar by a matrix. For example, in matrix multiplication, AB 6= BA.This is because of the way that matrices are multiplied. Moreover, there are severalrestrictions to which sets of matrices can be multiplied together.

Biological Data Analysis Using R

8.1. MATRICES IN R 125

For example, consider the operation A = XY where the matrix X has rX rows and cXcolumns of data and the matrix Y has rY rows and cY columns of data. For this operationto be defined, the number of columns in X, cX , must equal the number of rows in Y (e.g.,cX = rY ). If these are not equal, then you cannot perform the multiplication. Moreover,the resulting matrix A will have rX rows and cY columns. This is because the matrixmultiplication is conducted as:

Aij =N∑k=1

Xi,kYk,j

Essentially every row of X is multiplied against the corresponding column of Y.

In R matrix multiplication uses a unique operator that you probably haven’t seen yet. Toindicate that you want two matrices to be multiplied (and not the Hadamard product asabove) you use the compound operator % ∗%. That is right, it is a pair of percent signssurrounding the normal multiplication character (a.k.a. the asterisk). Two examplesusing the matrices X and Y are given below. Notice how XY 6= YX.

> X[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> Y

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 9 6 3[2 , ] 8 5 2[3 , ] 7 4 1> X %∗% Y

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 46 28 10[2 , ] 118 73 28[3 , ] 190 118 46> Y %∗% X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 54 72 90[2 , ] 42 57 72[3 , ] 30 42 54> X %∗% I

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> X − (X %∗% I )

[ ,1 ] [ , 2 ] [ , 3 ][1 , ] 0 0 0[2 , ] 0 0 0[3 , ] 0 0 0> I %∗% X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9>

Here both X and Y are both square and have the same number of rows and columns(e.g., the simplest case because we don’t have to make sure the correct rows and columnsmatch). The identity matrix, I defined in the section above is shown here with its groovy

Biological Data Analysis Using R

126 CHAPTER 8. MATRIX ANALYSIS

properties. Matrix multiplication by the identity matrix is transitive and will result inthe original matrix. A kind of matrix version of the scalar multiplying by one.2

Here is an example using the matrices X and Z, who have different dimensions.

> Z[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 5 9[2 , ] 2 6 10[3 , ] 3 7 11[4 , ] 4 8 12> Z %∗% X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 84 99 114[2 , ] 96 114 132[3 , ] 108 129 150[4 , ] 120 144 168> X %∗% ZError in X %∗% Z : non−conformable arguments

In the first case, Z %∗%X is defined and provides a result because the number of columnsin Z match the number of rows in X. The reverse of this multiplication, X %∗%Z, isundefined and R tells you so.

8.1.2 Matrix Operations

There are several other operations that can be conducted on matrices that you willprobably run across as you begin playing with matrices. Here are a smattering of afew.

The Diagonal

It is often necessary to interact with the diagonal, defined as the elements in the matrixwhose row index are equal to the column index, of a matrix. For example, in a covariancematrix, the diagonal elements are the variance estimates. In R you can get access tothe diagonal of a matrix by using the diag(). Some examples using the diag() functioninclude:

> X[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> diag (X)[1 ] 1 5 9> Z

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 5 9[2 , ] 2 6 10[3 , ] 3 7 11[4 , ] 4 8 12> diag (Z )[1 ] 1 6 11

2There are other matrices that have this property that are not as simple as this one and if you take somemultivariate statistics, it will blow your mind how cool they are...

Biological Data Analysis Using R

8.1. MATRICES IN R 127

Notice how even for non-square matrices the diagonal is defined. You can also extractand insert particular values for the diagonal as demonstrated below:

> X[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> origDiag <− diag (X)> origDiag[1 ] 1 5 9> diag (X) <− c (42 ,23 ,4)> X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 42 2 3[2 , ] 4 23 6[3 , ] 7 8 4> diag (X) <− origDiag> X

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9

A commonly used matrix that can easily be constructed using the diag() function is theIdentity Matrix, whose symbol is I. This matrix has the zeros everywhere except on thediagonal

> I <− matrix (0 ,nrow=3,ncol=3)> diag ( I ) <− 1> I

[ , 1 ] [ , 2 ] [ , 3 ][1 , ] 1 0 0[2 , ] 0 1 0[3 , ] 0 0 1

Finally, there is an operator called the trace of a matrix that is typically written as tr(A),which is the sum of the diagonal elements. If A is a variance, covariance matrix as iscommonly found in multivariate statistics, then its trace is the overall variance. In R wecan find the trace using a combination of the sum() and diag() functions as:

> X[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 2 3[2 , ] 4 5 6[3 , ] 7 8 9> sum( diag ( X ) )[ 1 ] 15

Matrix Determinant

The determinant of a matrix is scalar factor of a matrix. The calcuation of the determi-nant is somewhat complicated when we get to matrices that have more than two rowsand columns and I’ll let you go find a linear algebra book to look into it if you so desire.For small matrices, the determinant of a matrix, denoted as |A| is given as:

Biological Data Analysis Using R

128 CHAPTER 8. MATRIX ANALYSIS

|A| =∣∣∣∣ a11 a12

a21 a22

∣∣∣∣ = a11a22 − a12a21

In R the function det() is used to estimate the determinant of a matrix.

> X <− matrix ( c (1 ,6 ,3 ,4) ,nrow=2)> X

[ ,1 ] [ ,2 ][1 , ] 1 3[2 , ] 6 4> det (X)[1 ] −14

Matrix Transpose

The transpose of a matrix is an operation that exchanges the row and column indicesof the elements. This will change the dimensions of the matrix if it is not square. No-tationally, you will see several different ways to represent a transpose such as A

′or

AT .

In R the transpose operation is performed with the t () function.

> Z[ ,1 ] [ ,2 ] [ , 3 ]

[1 , ] 1 5 9[2 , ] 2 6 10[3 , ] 3 7 11[4 , ] 4 8 12> t (Z )

[ ,1 ] [ , 2 ] [ , 3 ] [ , 4 ][1 , ] 1 2 3 4[2 , ] 5 6 7 8[3 , ] 9 10 11 12> t ( t (Z ) )

[ , 1 ] [ , 2 ] [ , 3 ][1 , ] 1 5 9[2 , ] 2 6 10[3 , ] 3 7 11[4 , ] 4 8 12

Notice that the transpose of a transpose is equal to the original variable.

Matrix Inversion

For scalars, the inverse is defined as x−1 = 1x but for matrices it is slightly more com-

plicated. There are even large groups of matrices that cannot be inverted. One propertythat prevents inversion is if the matrix is singular (think black hole of mathematics ormatrices that have a zero determinant).

A common use for matrix inversion is in estimation of regression coefficients by leastsquares. In 6.2, we used the lm() function to estimate the intercept and slope coefficients.This can be done using matrix algebra and the inversion function ginv() found in the MASSlibrary. A one column matrix of slope coefficients B is estimated from the formula:

Biological Data Analysis Using R

8.1. MATRICES IN R 129

B = (X′X)−1X′Y

Where the matrix Y matrix is the normal matrix of response variables and the X matrixhas the first column of all ones (1) for the intercept and the remaining columns as thepredictor variables.

> X <− matrix ( c ( rep (1 ,10) , 1:10) , ncol=2 )> X

[ ,1 ] [ ,2 ][1 , ] 1 1[2 , ] 1 2[3 , ] 1 3[4 , ] 1 4[5 , ] 1 5[6 , ] 1 6[7 , ] 1 7[8 , ] 1 8[9 , ] 1 9

[10 , ] 1 10> Y <− matrix ( c(19 ,25 ,14 ,15 ,24 ,17 ,19 ,27 ,29 ,25))> Y

[ ,1 ][1 , ] 19[2 , ] 25[3 , ] 14[4 , ] 15[5 , ] 24[6 , ] 17[7 , ] 19[8 , ] 27[9 , ] 29

[10 , ] 25> l ibrary (MASS)> ginv ( t (X) %∗% X ) %∗% ( t (X) %∗% Y )

[ ,1 ][1 , ] 16.3333333[2 , ] 0.9212121> lm( Y ˜ c (1:10) )

Call :lm( formula = Y ˜ c (1 :10 ) )

Coef f ic ients :( Intercept ) c (1:10)

16.3333 0.9212

You can see from the comparison, both lm() and the matrix multiplication/inversionmethod produce the same estimates for the intercept and the slope coefficient. If youwere to make Z <−Y − mean(Y) (e.g., standardize it for mean zero), you could have the X ma-trix without the column for the interscept (β0 = 0) and you could get the same estimatefor the slope coefficient, β1.

Eigen Decompositions

An eigenvalue/eigenvector decomposition is a ”magical property” of matrices that canonly be appreciated by some experience in matrix algebra. However, we will be usingthem in the next section so it seems there is a need to introduce them here. Start by

Biological Data Analysis Using R

130 CHAPTER 8. MATRIX ANALYSIS

considering the square (kxk) matrix X and the identity matrix (I) in the characteristicequation |A− λI| = 0.

Using the matrix:

> A <− matrix ( c (1 ,6 ,3 ,4) ,nrow=2)> A

[ ,1 ] [ ,2 ][1 , ] 1 3[2 , ] 6 4

The eigenvalues for the matrix are given by solving the characteristic formula:

0 = |A− λI| (8.1)

=∣∣∣∣[ 1 3

6 4

]− λ

[1 00 1

]∣∣∣∣=

∣∣∣∣[ 1− λ 36 4− λ

]∣∣∣∣= (1− λ)(4− λ)− 18= λ2 − 5λ− 14

If we solve for λ we see that possible values are 7 and −2. These are called the eigenvaluesof the matrix A.

Each eigenvalue has an associated eigenvector such that:

Ax = λx

Where x is a vector (e.g, a matrix with only one column) that is matched to each ofthe k eigenvalues. The equation above is called the characteristic equation for the righteigenvector and a left eigenvector exists and has the form xA = xλ. From both of these,we need to solve for x. Starting with the largest eigenvalue, λ1 = 7, we have:

[1 36 4

] [e1e2

]= λ1

[e1e2

](8.2)

If we multiply these out, we get the following equations:

1e1 + 3e2 = 7e16e1 + 4e2 = 7e2

And here we have two equations in two variables and can easily solve for the valuesof e1 and e2 and these values define the eigenvector ~v1 = [e1, e2] that is linked to theeigenvalue λ1. We can do the same for the second vector (which I will let you play with inthose boring weekend hours where you are wishing that you had some really cool mathproblem to solve).

Biological Data Analysis Using R

8.1. MATRICES IN R 131

It is important to point out here that the values for ~v1 can be scaled. As you look at theequations above we can solve for the components and find that e1 = e2

2 . There are a lotof values for e1 and e2 that make this statement true. And if we think about the vector~v1 = [e1, e2] as a project away from the origin a distance of e1 on one axis and e2 on asecond orthogonal axis it may make a bit more sense. There are several vectors that willpoint in a direction that will intersect the point (e1, e2) all of which are the same exceptfor a scaling factor. This is graphically shown in Figure 8.1.2 with two vectors pointingin the same direction but with different lengths.

Figure 8.1: Image depicting two vectors vred = [4, 2] and vblue = [2, 1] that are projecting in thesame direction but have different magnitudes.

The reason I bring this up is that it is common for routines that calculate vectors, suchas we are doing here for the eigenvector decomposition, to scale the vectors such thattheir lengths are set to some normalizing constant such as 1. As a result, if you solve for~v1 and then check it below with the eigen() function you may not get the same values butif you were to plot the vectors, the lines away from the origin would be pointing in thesame direction.

There are some interesting properties of eigenvalues and eigenvectors.

• If the original matrix is symmetric (actually non-negative semi-definite but whose

Biological Data Analysis Using R

132 CHAPTER 8. MATRIX ANALYSIS

watching), the original matrix A =∑k

i=1 λieie′i. This is called the spectral decompo-

sition of the matrix A.

• The product of the eigenvalues is equal to the determinant of the original matrix(e.g.,

∏ki=1 λi = |A|).

• The sum of the eigenvalues is equal to the trace of the matrix (e.g.,∑k

i=1 niλi = tr(A)where ni is a

• If it is possible to invert A then the eigenvalues of A−1 will be the inverse of theeigenvalues of A (e.g,. they will be λ−1

i .

• The eigenvectors of A and A−1 are identical.

R has a eigen() function that takes a square matrix and returns the eigen values andeigenvectors as a list. Here is an example using our little friend the A matrix we touchedon above.

> A[ ,1 ] [ ,2 ]

[1 , ] 1 3[2 , ] 6 4> rootsOfA <− eigen (A)> rootsOfA$values[1 ] 7 −2

$vectors[ ,1 ] [ ,2 ]

[1 , ] −0.4472136 −0.7071068[2 , ] −0.8944272 0.7071068

Baring the possibility that I actually just copied and pasted the results from eigen() intothe discussion above on vi = [e1, e2], the answer looks like it should.

8.2 Stage-Classified Matrix Models

Stage-classified matrix models are concerned with understanding the processes that in-fluence the persistence of populations. These models tacitly assume that the continuumof life histories for a species can be partitioned into discrete stages and that a censusof individuals in a population can be performed wherein we can tally the number ofindividuals in each of these discrete stages. Some species lend themselves to stage-classification better than others and the distinctions on how to go about defining stagesis best left to another course. Here we are going to introduce the notation of a matrixmodel in R and then perform some analyses on these models. This Chapter is intendedto only whet your appetite a bit on matrix models and for those that are interested, youshould seek out another course or at least read a good text such as Caswell (2001).

8.2.1 Transition Matrices & Census Vectors

For the sake of discussion, lets assume that we are working with a plant, Grenus growii,that has the following four different distinct life stages. Moreover, from our vast knowl-

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 133

edge of this organism, we have the accompanying information about the way this speciesproceeds through life stages.

Seed The seed stage lasts a single time step (e.g., there is no persistent seed bank) andonly 50% of the seeds actually germinate, the others are either eaten or rot.

Seedling The seedling stage is a non-reproductive stage and herbivory removes 20%of the individuals that get into this stage and the remaining individuals becomejuveniles.

Juvenile The juvenile stage is the first reproductive stage and on average each juvenileproduces 1.3 offspring. Depending upon the habitat the juvenile is located in, halfmove on to the next stage and a quarter stay as a juvenile. The remining ones areeaten.

Adult The final adult stage is where most of the reproduction happens with each indi-vidual producing an average of 3.1 offspring. Half of the adults persist in the adultstage from one time step to the next.

A diagram of this fictions species is shown in Figure 8.2.

Figure 8.2: The A graphical depiction of the life history stages in the fictitious plant Grenus growii

Here each of the spheres in this image represent a stage. The arrows between the stagesdepict either fertility estimates (labeled fX ) when they point back to the seed stage, ortransitions (labeled pXY signifying the probability that an individual proceeds to stageX from stage Y . From the description we have above, we can associate values withthis particular life history diagram with particular parameters. In Table 8.1 I show theparameters for each of the variables listed.

These parameters can now be put into a transition matrix3, A, that has a particularlystrict form.

A =

f1 f2 f3 f4

p21 p22 p23 p24

p31 p32 p33 p34

p41 p42 p43 p44

(8.3)

3Actually this is not a transition matrix as it does not sum to 1 rather it is a Leslie matrix but I think Ican get away with generalizing the term a bit here.

Biological Data Analysis Using R

134 CHAPTER 8. MATRIX ANALYSIS

Table 8.1: Table of life history values separated into A Fertility estimates (the fX items) and Btransition probabilities depicting the movement between stages and within stages.

A. Fertility Estimates

Stage Parameter Value

Seed f1 0Seeding f2 0Juvenile f3 1.3Adult f4 3.1

B. Transition probabilities.

Transition Parameter Value

Seed → Seedling p21 0.5Seedling → Juvenile p32 0.8Juvenile → Adult p43 0.5Juvenile → Juvenile p33 0.25Adult → Adult p44 0.5

The items in the matrix are partitioned into two components, the top row records thefecundity values, fX , and the second and remaining rows depict the probabilities oftransition, pXY . Inserting the observed values into this matrix gives us:

A =

0 0 1.3 3.1

0.5 0 0 00 0.8 0.25 00 0 0.5 0.5

(8.4)

In R we can create this matrix using the following code:

> A <− matrix (0 ,nrow=4,ncol=4)> A[1 ,3] <− 1.3> A[1 ,4] <− 3.1> A[2 ,1] <− 0.5> A[3 ,2] <− 0.8> A[3 ,3] <− 0.25> A[4 ,3] <− 0.5> A[4 ,4] <− 0.5> A

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ][1 , ] 0.0 0.0 1.30 3.1[2 , ] 0.5 0.0 0.00 0.0[3 , ] 0.0 0.8 0.25 0.0[4 , ] 0.0 0.0 0.50 0.5

The entries in this matrix have some rather special properties if we put the values intoit as directed.

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 135

Intrinsic Growth Rate

The Euler-Lotka’s integral equation for the instantaneous grow rate, r, is well known tomost biologists (...) and has the form:

1 =∫ ∞

0l(x)m(x)e−rxdx

where the term l(x) is the fraction of reproductive individuals surviving to x, m(x) is thefertility rate of individuals at x, and r is the growth. The r component here is the partthat we are interested in looking at because:

r =

< 1 : Populationsizedecayingexponentially= 1 : Stablesizethroughtime> 1 : Populationsizeincreasingexponentially

We can provide an estimate of r using an eigenvalue decomposition of the transitionmatrix A. Due to the way the matrix is set up, the largest non-imaginary eigenvalue ofthe matrix (λ1 as defined in 8.1.2) is equal to r. So, once the matrix A is entered into R, we can find the growth parameter as:

> A[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ]

[1 , ] 0.0 0.0 1.30 3.1[2 , ] 0.5 0.0 0.00 0.0[3 , ] 0.0 0.8 0.25 0.0[4 , ] 0.0 0.0 0.50 0.5> eigen (A)$values[1 ] 1.2075472+0.0000000 i −0.0067844+0.8194141 i −0.0067844−0.8194141 i[4 ] −0.4439783+0.0000000 i

$vectors[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ]

[1 , ] 0.8603823+0 i 0.7490103+0.0000000 i 0.7490103+0.0000000 i −0.4753001+0 i[2 , ] 0.3562521+0 i −0.0037839−0.4570089 i −0.0037839+0.4570089 i 0.5352740+0 i[3 , ] 0.2976372+0 i −0.4052283+0.1306829 i −0.4052283−0.1306829 i −0.6170499+0 i[4 , ] 0.2103303+0 i 0.1682952+0.1431813 i 0.1682952−0.1431813 i 0.3268348+0 i

here we can see that λ1 is not a complex number (the +0.0000000i part tells us that) eventhough there are some complex eigenvalues (roots) of this matrix. Moreover, it suggeststhat the overall behavior of this transition matrix is to increase overall population sizewith an instantaneous rate of r ≈ 1.2.

The particular values of λ will determine the overall long term behavior of the population.Essentially as time increases t : 0 → ∞, the impact of λ is determined by raising it tohigher and higher powers. Figure 8.3 shows the projected impact on population growthrate as a function to two values for λred = 0.8 and λblue = 1.2.

Biological Data Analysis Using R

136 CHAPTER 8. MATRIX ANALYSIS

Figure 8.3: Effects of the instantaneous growth rate λ as a function of time for both exponentialgrowth (λblue = 1.2) and exponential decay (λred = 0.8).

Stable Stage Distribution

The values in A also contain information on the relative proportion of individuals thatwill be in each stage class as the population stabilizes into a steady state (either growth,stable, or declining). This information is contained in the eigenvector that is associatedwith λ1. From the output above we see that:

> ssd <− as .numeric ( eigen (A)$vectors [ , 1 ] )> ssd[1 ] 0.8603823 0.3562521 0.2976372 0.2103303> sum( ssd )[1 ] 1.724602> ssd <− ssd / sum( ssd )> ssd[1 ] 0.4988875 0.2065706 0.1725831 0.1219587> sum( ssd )[1 ] 1

Here you see that the eigenvalues are scaled to unit size (e.g., t (e i ) %∗%e i = 1) as mentionedabove which results in a total sum of the vector of sum(ssd) = 1.724602. If we are interested in

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 137

finding the proportion of the population that is in each stage then we need to standardizethe vector so that the sum(ssd) = 1 and this is done by dividing every element by the total.As a result, ssd suggests that at equilibrium there should be 49% of the individuals asseeds, 21% as seedlings, 17% as juveniles and 12% as adults.

We will return to these numbers and the estimate for r in the next subsection when weiterate the data manually.

Bar Plots

As in the previous example, we determined the stable age distribution to estimate theproportion of the total population that is in each group. Graphically, this material couldbe depicted as a bargraph and since we haven’t covered how to make bar graphs yet,this is as good a time as any...

There is an option in the normal plot () function, type="h" that will kind of plot bars of yourdata to a figure. Actually, these are high density lines and not real bar plots. This iswhat I used to make Figure 4.2 and at that time it got the job done correctly, but a truebar plot is something that looks a bit different than those lines.

R provides the function barplot() that takes a vector of heights and produces a generalbarplot for you. Without modifications, the function barplot() does not produce a veryinteresting plot in my opinion. However, there are several optional arguments that canbe used to create a more informative graphic. They include:

• names.arg a vector of names that you can have placed on the x−axis below the bars

• width controls the width of the bars.

• space controls the amount of area between the bars with a value of zero having thebars touch and positive numbers equal to that number of bar width (e.g., space=2

plots a bar and then 2 bar widths before the next bar shows up).

• horiz is a logical flag that will plot the bars horizontally instead of vertically.

• col can pass as a single color or a vector of colors which are used to color the bars.

• ylim can adjust the limit of the y−axis as in normal plotting routines.

• xlab \& ylab Labels for the x− and y−axes.

Using the data from λ1 in the previous section, we can plot the data as (shown in Figure8.4.

> ssd[1 ] 0.4988875 0.2065706 0.1725831 0.1219587> barplot ( ssd )> barplot ( ssd , ylim=c (0 ,1 ) , xlab="Stage" , ylab="Proportion of Individuals" ,+ names. arg=c ("Seed" ,"Seedling" ,"Juvenile" ,"Adult" ) , col=c ("red" ,"blue" ,"green" ,"yellow" ) )

The barplot() function can also be used to create stacked graphs 8.5

To create this example, I used the following code which as t

Biological Data Analysis Using R

138 CHAPTER 8. MATRIX ANALYSIS

Figure 8.4: Examples of two different calls to the plotting function barplot(). The parameters usedto create these plots is given in the R code.

> x <− matrix ( runif (9 ) ,nrow=3)> x

[ ,1 ] [ ,2 ] [ , 3 ][1 , ] 0.2355922 0.396869276 0.5674993[2 , ] 0.7247734 0.001881527 0.9215767[3 , ] 0.4625868 0.767329832 0.6408461> barplot ( x ,names. arg=c ("Control" ,"A" ,"B" ) , xlab="Treatments" , ylab="Value" ,+ legend=c ("Category A" ,"Category B" ,"Category C" ) )

These stacked plots treat every column of data as a single bar and the order in which therows are presented is the order in which the stacking occurs. You can standardize theplot to all have the same height by dividing each column by that columns sum providinga proportional barplot.

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 139

Figure 8.5: Example of a stacked bar plot with multiple categories represented in each Treatment.

8.2.2 Projecting Stage Sizes

In this matrix model we have been playing with, the census count of individuals ineach of the four stages can be represented by the vector ~n and in R as a matrix whosedimensions are (4x1). Assuming that I start with 12 seeds, 34 seedlings, 21 juveniles, and12 adults, the vector can be depicted as:

> n <− matrix ( c (12 ,34 ,21 ,12))> n

[ ,1 ][1 , ] 12[2 , ] 34[3 , ] 21[4 , ] 12

Using this notation, we can predict what the number of individuals in the next time slicewill be given A and n as:

nt+1 = Ant

Biological Data Analysis Using R

140 CHAPTER 8. MATRIX ANALYSIS

> A[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ]

[1 , ] 0.0 0.0 1.30 3.1[2 , ] 0.5 0.0 0.00 0.0[3 , ] 0.0 0.8 0.25 0.0[4 , ] 0.0 0.0 0.50 0.5> n

[ ,1 ][1 , ] 12[2 , ] 34[3 , ] 21[4 , ] 12> A %∗% n

[ ,1 ][1 , ] 64.50[2 , ] 6.00[3 , ] 32.45[4 , ] 16.50

So after one generation, we can see that the number of seeds, juveniles, and adults allincreased but the number of seedlings decreased. If we look at the next time step, wesee that:

nt+2 = Ant+1

= AAnt+1

= A2nt

And in general the vector of stage sizes at any arbitrary time step can be written as:

nt = Atn0 (8.5)

Lets make a matrix of n values for time 1→ 11 in R and calculate the number of individ-uals in each stage for each time step. I use 11 here because the matrix starts countingat column 1 which will correspond to our time t = 0 so when t = 10 the column will be11. Lets also set the first column (our t = 0) equal to the census population size we wereusing above.

> N <− matrix (0 ,nrow=4,ncol=11)> N

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ ,10] [ ,11][1 , ] 0 0 0 0 0 0 0 0 0 0 0[2 , ] 0 0 0 0 0 0 0 0 0 0 0[3 , ] 0 0 0 0 0 0 0 0 0 0 0[4 , ] 0 0 0 0 0 0 0 0 0 0 0> N[ ,1 ] <− n> N

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ ,10] [ ,11][1 , ] 12 0 0 0 0 0 0 0 0 0 0[2 , ] 34 0 0 0 0 0 0 0 0 0 0[3 , ] 21 0 0 0 0 0 0 0 0 0 0[4 , ] 12 0 0 0 0 0 0 0 0 0 0

Now, for time steps 1→ 10 (and in the matrix N columns 2→ 11) we will use the equation8.5 to calculate the number of individuals in each group.

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 141

> t <− 1> N[ , ( t +1) ] <− A %∗% N[ , t ]> t <− t + 1> N

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ ,10] [ ,11][1 , ] 12 64.50 0 0 0 0 0 0 0 0 0[2 , ] 34 6.00 0 0 0 0 0 0 0 0 0[3 , ] 21 32.45 0 0 0 0 0 0 0 0 0[4 , ] 12 16.50 0 0 0 0 0 0 0 0 0> t[ 1 ] 2

OK, here I am going to do something that saves some typing (you can use the up cursorkey to repeat the last entry you typed in the R interpreter and I will use this to make mylife a bit easier). I have defined the variable t such that it will be used to indicate whichcolumn of the matrix to use (the ( t+1) part) as well as the exponent to the matrix A. ThenI will increment the variable t by one and redo it again and again until I’ve filled up thecolumns of N.

In the following code examples, I show that you can use a semicolon (;) to put more thanone command on a line. Again, I combine the assignment of counts to the appropriatecolumn of N and then update the counter variable t each time through until all elevencolumns are full. In Chapter 11 you will learn how to use a loop to do this much easierbut until then using the up cursor key in the R interpreter is good enough.

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ ,10] [ ,11][1 , ] 12 64.50 93.3350 0 0 0 0 0 0 0 0[2 , ] 34 6.00 32.2500 0 0 0 0 0 0 0 0[3 , ] 21 32.45 12.9125 0 0 0 0 0 0 0 0[4 , ] 12 16.50 24.4750 0 0 0 0 0 0 0 0> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1> N

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ][1 , ] 12 64.50 93.3350 92.65875 95.68719 131.93725 168.77519 193.20372[2 , ] 34 6.00 32.2500 46.66750 46.32937 47.84359 65.96862 84.38759[3 , ] 21 32.45 12.9125 29.02813 44.59103 48.21126 50.32769 65.35682[4 , ] 12 16.50 24.4750 18.69375 23.86094 34.22598 41.21862 45.77316

[ ,9 ] [ ,10] [ ,11][1 , ] 226.86065 281.25553 343.80907[2 , ] 96.60186 113.43032 140.62776[3 , ] 83.84928 98.24381 115.30521[4 , ] 55.56499 69.70713 83.97547

So this is a large number of values here so lets plot this out to see what the stages doas we go through 10 time steps. The code used to produce the image in Figure 8.2.2is:

> plot (1:11 ,N[1 , ] , xlab="" , ylab="" , axes=F, bty="n" , col="red" , ylim=ylim , type="l" , lwd=2)> par (new=T )> plot (1:11 ,N[2 , ] , xlab="" , ylab="" , axes=F, bty="n" , col="blue" , ylim=ylim , type="l" , lwd=2)

Biological Data Analysis Using R

142 CHAPTER 8. MATRIX ANALYSIS

> par (new=T )> plot (1:11 ,N[3 , ] , xlab="" , ylab="" , axes=F, bty="n" , col="green" , ylim=ylim , type="l" , lwd=2)> par (new=T )> plot (1:11 ,N[4 , ] , xlab="t" , ylab="Number of Individuals" , axes=T, bty="n" , col="pink" ,+ ylim=ylim , type="l" , lwd=2)> legend (2 ,350 ,c ("Seed" ,"Seedling" ,"Juvenile" ,"Adult" ) , col=c ("red" ,"blue" ,"green" ,"pink" ) ,+ lwd=2,bty="n" )

I use the par(new=T) to overlay the lines on a single graph (see 4.1.1 for more on this). Ialso turn off the labels and axes for the first three plots because if you plot them overand over again, they look too dark on the graphic (think printing the same line on top ofitself numerous times). On the last one, I set the labels for the axes and the turn on theaxes. Also included is the code I used to add the legend to the image. See ?legend for acomplete discussion of the options that you can provide to this function.

Figure 8.6: Size of the four stage classes through time.

We can check some of the values that we estimated directly from A using the eigendecomposition by looking at the numbers in the matrix N. First, the growth rate weestimated from the first eigenvalue λ1 ≈ 1.2 looks pretty close to that estimated from theraw counts.

> eigen (A)$values [1 ]

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 143

[ 1 ] 1.207547+0 i> sum(N[ ,11 ] ) / sum(N[ ,10 ] )[ 1 ] 1.215202

And the proportion of individuals in each class was estimated by standardizing the firsteigenvalue v1 = v1/

∑4i=1 v1i is pretty close to what we see in N (and I throw in the first

census so that you don’t think I put values in there that were already pretty close).

> N[ ,1 ] / sum(N[ , 1 ] )[ 1 ] 0.1518987 0.4303797 0.2658228 0.1518987> N[ ,11] / sum(N[ ,11 ] )[ 1 ] 0.5028525 0.2056811 0.1686445 0.1228219> ssd[1 ] 0.4988875 0.2065706 0.1725831 0.1219587

If we were to iterate this a bit longer you would see that the ”brute force” method ofgetting the population growth rate and the stable age distributions converge towardswhat was estimated. In fact, Figure 8.2.2 shows the mean absolute deviation (MAD)representing the differences between the distribution of individuals in each stage fromthe predicted stable stage distribution (ssd) we calculated earlier. As you can see, itapproaches the expected values pretty quickly.

Biological Data Analysis Using R

144 CHAPTER 8. MATRIX ANALYSIS

Figure 8.7: Differences in estimated proportions of individuals in each stage from what wasexpected through time.

8.3 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• %*% Binary operator to perform matrix multiplication. An example would be X \%\∗\% Y.

• as.matrix(x) Coerces the variable x into the data type matrix.

• barplot(x) Creates a barplot of the values in x.

• det(x) Calculates, if possible, the determinant of the matrix in x.

• diag(x) Returns the diagonal (e.g., those entries whose row and column indices areequal) of the matrix in x.

• dim(x) Returns the dimensions of the matrix x (e.g., the number of rows and columns).

Biological Data Analysis Using R

8.3. USEFUL FUNCTIONS 145

• eigen(x) Returns the eigenvalue/eigenvector pairs for the matrix in x as a list. Valuesare sorted in descending numerical order and vectors are scaled to unit length.

• ginv(x) Attempts to calculate the generalized inverse of x.

• legend(x,y,c) Creates a legend for the plot at the coordinates (x, y) with the entriesin c.

• matrix(x) Creates a new instance of the matrix data type of the values in x. You willprobably need to specify nrow and ncol to set the proper size for the matrices.

• read.table(x) Reads the file x into memory. See ?read.table for the copious amounts ofadditional parameters that may be needed as well as Chapter 3.

• t(x) Returns the transpose of the matrix in x (e.g., reverses the row and columnindices)

Biological Data Analysis Using R

146 CHAPTER 8. MATRIX ANALYSIS

8.4 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. In considering the instantaneous growth rate r, it was mentioned that λ1 > 0 and thisis what you will find in most cases. However, it is possible to get values of λ < 0.For the following values of λ make a graph of t vs. λt as shown in Figure 8.3 anddescribe the behavior of the population if these were the real values of r.

(a) −1 < λ1 < 0.

(b) λ1 < −1.

2. Create a matrix of random numbers using the runif () function and make a barplot ofthe values. What happens when you pass the optional argument beside=T?

3. Standardize the columns of data in the matrix from the previous example so that thesum of each column is equal to 1. Replot this with using the function barplot() as donefor Figure 8.5 with the beside=F option. How does standardizing each row influencethe display of the plot?

Biological Data Analysis Using R

Chapter 9

Working With Strings

While the majority of biological data is numeric in nature there are still several importantreasons to be able to manipulate character-based information. For example, you may bedownloading all the references from a online database such as WebOfScience and wantto mine the abstracts for metadata. You may also be interested in working with sequencedata which consists of mostly text information. In this relatively short chapter we willlearn about how we can work with string in data in R and look at a few examples usinggenetic sequences.

In this chapter, you will focus on the following topics:

• Learn how to work with string data to perform tasks such as parsing, searching,and replacement.

• Learn how to access sequence based data and pre-process it for importation into R

• Learn how to create genetic distance matrices.

• Construct Neighbor-Joining trees and display them in R

9.1 Parsing Text Data

At a most basic level you need to understand that character data in R is treated as asingle token in the same way that integer and numeric data is treated. For example,consider the following code:

> x <− c ("bob" ,"mary" ,"johnathan" )> length ( x )[1 ] 3> x <− "George Stephen Sr."> length ( x )[1 ] 1> x <− c (1 ,2 ,3)> length ( x )[1 ] 3> x <− 3> length ( x )[1 ] 1

147

148 CHAPTER 9. WORKING WITH STRINGS

9.1.1 Finding Lengths of Character Sequences

So R treats a character data type, independent of the length of the items in the variable,as a single entry. Once we understand this then the rest of this Chapter really begins totake shape and make sense.

So, if R thinks that the everything between a pair of quotes is a single instance of acharacter data type then how do we figure out how many letters are contained betweenthe quotes? The answer here is the function nchar().

> x <− "George Stephen Sr."> nchar ( x )[ 1 ] 18

Another commonly used function for dealing with strings is the strsplit () function. Thisfunction takes the string of characters that you are interested in splitting as well as thecharacter you want to split it on and returns the chunks as a list. This returning-as-a-list behavior is kind of a pain in the butt so at the same time I introduce this function Iwill also show the unlist() function at the same time.1

> partsOfName <− unlist ( s t r sp l i t ( x , " " ) )> partsOfName[1 ] "George" "Stephen" "Sr."> nchar ( partsOfName )[1 ] 6 7 3

Here is another example as to how we may go about cycling through a set of words ina phrase and doing some operation on them. The first sentence from the first chapterof Darwin’s The Origin Of Species is, ”WHEN we look to the individuals of the samevariety or sub-variety of our older cultivated plants and animals, one of the first pointswhich strikes us, is, that they generally differ much more from each other, than dothe individuals of any one species or variety in a state of nature.” While this is a veryinteresting sentence, we are going to use it to show you how to break down the sentenceinto an array of words and then tally the number of times each word is used.

We begin by making the sentence all lowercase and without punctuation because thesimple matching procedure would consider ”When” different than ”when” and the strsplit ()

function will cut up the string on the spaces (that I what I will tell it to do)

> phrase <− "when we look to the individuals of the same variety or sub-variety of our older "+ "cultivated plants and animals one of the first points which strikes us is that they "+ "generally differ much more from each other than do the individuals of any one species or "+ "variety in a state of nature"> wordList <− unlist ( s t r sp l i t ( phrase ," " ) )> table ( wordList )wordList

a and animals any cult ivated d i f f e r1 1 1 1 1 1

do each f i r s t from generally in1 1 1 1 1 1

individuals is look more much nature2 1 1 1 1 1

of older one or other our5 1 2 2 1 1

1This function takes a list and turns the items in it into a vector which is easier to work with.

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 149

plants points same species state str ikes1 1 1 1 1 1

sub−variety than that the they to1 1 1 4 1 1

us variety we when which1 2 1 1 1

9.1.2 Extracting Substrings

It is not possible to use the normal subscripting approaches to access the individualcharacters within strings because R treats the entire sequence of characters betweenthe quotation marks as a single item. However, you can extract internal components ofa string by using the substring() function.

> phrase <− "A Goat, that was sitting next to the gentleman in white, shut his eyes and said+ in a loud voice, ’She ought to know her way to the ticket-office, even if she doesn’t know+ her alphabet !’"> substring ( phrase , 34, 70)[1 ] "the gentleman in white, shut his eyes"> substring ( phrase , 98)[1 ] "’She ought to know her way to the ticket-office, even if she doesn’t know her alphabet !’"

The function takes the string to be searched and the starting and ending locations inthe string and returns the characters in between. If you do not provide an endingnumber, it will return all the characters up to the end. This is a shorthand way of sayingsubstring( phrase, x, nchar(phrase) ).

It is also possible to use vector notation in pulling out substrings by passing vectors tothe start and end arguments.

> startPosit ions <− c(34 ,3 ,58 ,172 ,67)> endPositions <− c(36 ,6 ,61 ,174 ,70)> substring ( phrase , startPosit ions , endPositions )[1 ] "the" "Goat" "shut" "her" "eyes"

9.1.3 Concatenating Strings

Vectors of character data can be concatenated to form a single long string. This is veryhelpful in creating labels for graphs that have to include the value of a variable andfor times when you need to open a lot of data files that have a predictable file namingscheme. In R string concatenation is accomplished using the paste() function.

> stringVector <− substring ( phrase , startPosit ions , endPositions )> stringVector[1 ] "the" "Goat" "shut" "her" "eyes"> paste ( stringVector , collapse=" " )[ 1 ] "the Goat shut her eyes"> paste ( stringVector , collapse="|" )[ 1 ] "the|Goat|shut|her|eyes"

Biological Data Analysis Using R

150 CHAPTER 9. WORKING WITH STRINGS

9.1.4 Matching & Substitution

The final tasks we will look into in this section on string operations are matching andsubstitutions. There are a lot of times when the ability to see if a particular set ofstrings has a specific substring within it. This is the realm of matching and is primarilyaccomplished by the functions grep() and regexpr(). This last function allows you to usewhat are called Regular Expressions (RE) to scan through string. While this is a verypowerful method for pattern matching and is something that if you are going to do anyextensive work with strings should know, I am not going to cover it in this Chapter. Infact, it probably needs its own chapter and perhaps in a future version of this text I willinclude it. For those of you who work with string data on a regular basis, look up theregexpr function and have at it, it will make your life easier. For the rest of us, lets dig intogrep for a little light matching exercises.

The grep function takes a pattern that you are looking for and a string that you want tolook into. A simple example would be:

> x <− "The quick brown fox jumped over the candle stick"> grep ("fox" ,x )[ 1 ] 1> any ( grep ("fox" ,x ) )[ 1 ] TRUE> any ( grep ("o" ,x ) )[ 1 ] TRUE> any ( grep ("dog" ,x ) )[ 1 ] FALSE

In general, the grep function returns an integer indicating that the string either has ordoes not have a copy of the pattern in it. I wrapped the grep function here inside theany() function because it will take either a single argument or a vector of arguments andreturn a logical value.

It is also possible to substitute values in a string with new items. There are two functionsthat perform string substitutions, sub and gsub. Both of these functions take at least threearguments;

1. A pattern to match,

2. The string to replace the matched pattern with, and

3. The string to search within.

The sub function replaces the first occurrence of the pattern whereas gsub replaces all ofthem (the g stands for global).

> x <− "The quick brown fox jumped over the candle stick with all the kings men."> sub ("the" ,"THE" ,x )[ 1 ] "The quick brown fox jumped over THE candle stick with all the kings men."> gsub ("the" ,"THE" ,x )[ 1 ] "The quick brown fox jumped over THE candle stick with all THE kings men."> gsub ("the" ,"THE" ,x , ignore . case=T )[1 ] "THE quick brown fox jumped over THE candle stick with all THE kings men."

Both of these functions have optional arguments, the most common one of which isignore.case option that allows the searching and replacing to either take into consideration

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 151

the case of the letters when matching or not.

9.1.5 Slightly More In Depth Examples: Genetic Sequence Analyses

Genetic sequences are essentially long character strings and R has a few different li-braries available to you for the analysis of sequence data. I am not going to get into whata genetic sequence is, if you do not already know about it then you probably should notbe calling yourself a biologist... In this section, we will:

1. Briefly discuss how we go about getting DNA sequence data

2. Learn how to align sequences

3. Import sequence aligned sequence data into R

4. Create a distance matrix from the sequences

5. Use R to estimate a Neighbor-Joining tree from the sequence data

Getting DNA Sequence Data

The mother of all sequence repositories that you can access (without actually doing thesequencing yourself) is the NCBI web database located at http://www.ncbi.nlm.nih.gov/Here you can run database queries based upon taxa, genes, groups, or whatever. Thebasic results of a search are given as an annotation (just below). This annotation hasthree parts,

1. The meta data in the top section that contains the locus definition, size, who foundit, references and a the taxonomy of the organism.

2. The ”FEATURES” of the record that describe what is in the sequences (coding andnon-coding regions if known), some geographical and taxonomic information thathas been standardized (good for data mining and putting on a map) as well as thetranslation of genetic sequence into amino acids if appropriate.

3. The ”ORIGIN” which contains the raw sequence information.

An example of a record is given below

LOCUS FJ347583 278 bp DNA linear INV 01−JUL−2009DEFINITION Araptus attenuatus haplotype 5 muscle protein 20 (MP20) gene ,

part ia l sequence .ACCESSION FJ347583VERSION FJ347583.1 GI:227345175KEYWORDS .SOURCE Araptus attenuatus

ORGANISM Araptus attenuatusEukaryota ; Metazoa ; Arthropoda ; Hexapoda; Insecta ; Pterygota ;Neoptera ; Endopterygota ; Coleoptera ; Polyphaga ; Cucujiformia ;Curculionidae ; Scolytinae ; Araptus .

REFERENCE 1 ( bases 1 to 278)AUTHORS Garrick ,R.C. , Meadows,C.A. , Nason,J .D. , Cognato ,A. I . and Dyer ,R.J .TITLE Variable nuclear markers for a Sonoran Desert bark beetle , Araptus

attenuatus Wood ( Curculionidae : Scolytinae ) , with applications torelated genera

Biological Data Analysis Using R

152 CHAPTER 9. WORKING WITH STRINGS

JOURNAL Conserv . Genet . 10 (4 ) , 1177−1179 (2009)REFERENCE 2 ( bases 1 to 278)

AUTHORS Garrick ,R.C. , Meadows,C.A. , Nason,J .D. , Cognato ,A. I . and Dyer ,R.J .TITLE Direct SubmissionJOURNAL Submitted (26−SEP−2008) Department of Biology , Virginia

Commonwealth University , 1000 West Cary Street , Richmond, VA 23284,USA

FEATURES Location/Quali f ierssource 1..278

/organism="Araptus attenuatus"/mol type="genomic DNA"/db xref="taxon:634056"/haplotype="5"

gene <1..>278/gene="MP20"/note="muscle protein 20; coding region not determined"

ORIGIN1 ctaaaatcaa cacttccgga ggacaattta aattcatgga aaacatcaac aagtaagaaa

61 aaaataattt gacatgtaaa taatgtagag aaaattcata aacattccta t t t t t t a t t g121 atttgtcaat at t tag t t tg gaactaaact ctgacaatca attatacagg gtgacaattc181 taattacatt tccattcaat gccaactaga aatttcgtga aaaaaaaatt gt t tc tatgc241 caaacatact gttttataag atttaattcc agaaattt

//

Sequence Formats & Aligning Genetic Sequences

The format of the sequence data like this is a bit verbose but very informative. Whenwe work with sequence data we will use an abbreviated file format, the FASTA format,to work with sequences. This format is very compact and as a result, it is rather easy touse. In general, FASTA files are simple text files that have blocks of information for eachsequence. Each block contains a summary line that must begin with the greater thancharacter (>) and can be anything you like. It is common to put the accession numbers,locus identifier, taxonomy and other information into this line. The lines following thesummary line is the raw sequence. If you want to have more than a single taxon in afile, you just put the next taxon block blow the previous one and continue. In generalthey look like this (this is an excerpt from an example data set that you have in the classfolder):

>Pinus caribaea var . hondurensisGGTTCAAGTCCCTCTATCCCCACCCAGGTTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATTCCATTGGTTCGAATCCATTCTAATTTCTCGATTCTTTTACCTCGCTATTTTTTTTTTTTCATGAAGAGAAGAAATTAGAACATGAATCTTTTCATCCATCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCAATTTATTTTGTGATATATGATCTACATAGAATAGATTAGATCNTTTTTAAATTATTCAATTGCAGTCCATTTTTATCATATTAGTGACTTCCAGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTTTTACTTCTTTTTAGTTGACACAAGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGGATAGCTCATTTGGTAAACCAAAGGACTGAAAATCCTCGTGTCACCAGTTCAAAT

>Pinus echinataACCCAGGTTCGTTCCCGAACGGATTGATCTATCTTCTCCAATTCCATTGGTTCGAATCCATTCTAATTTCTCGATTCTTTTACCTCGCTATTTTTTTTTTTCATGAAGAGAAGAAATTAGAACATGAATCTTTTCATCCATCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCAATTTATTTTGTGATATATGATCTACATAGAATAGATTAGATCATTTTTAAATTATTCAATTGCAGTCCATTTTTATCATATTAGTGACTTCCAGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTTTTACTTCTTTTTAGTTGACACAAGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGATAGCTCAGTTGGTAGAGCAGAGGACTGAAAATC

When conducting analyses of genetic sequence data, it is important that you are confi-dent that all the sequences you have are of homologous portions of the genome. For the

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 153

example I used here, I downloaded some genetic sequence data for a handful of conifersin the family Pinaceae from the NCBI website. The sequences I was looking for is a com-mon inter-genic spacer region between the genes encoding for tRNA-trnL and tRNA-trnF.These sequences were between 390-470 base pairs in length and are in the file namedconfiers.fasta in the folder for this chapter. I cleaned up the summary lines in this file soit only has the genus and species names rather than all the other stuff. This makes it abit easier for you in the future when you interact with the data.

Before I played with these sequences, I ran an alignment on them to make sure we weredealing with the matching sequences across taxa. There are many ways to do this and Ijust used the online ClustalW server at http://align.genome.jp to align the sequences forme. This is not something you want to do by hand and it is much better to let a computerdo some of the work for you. This algorithm aligns all the sequences and returns thefile in a clustal format. This is another text file but this time all the species have beendisplayed in blocks with homologous sequence locations in the same text column. Anexample of this is shown below with gaps (insertions/deletions) indicated as the dashcharacter (−).

Pinus caribaea var . hondurensi CC−−−CACCCAGG−TTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATTPinus taeda −−−−−−ACCCAGG−TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATTPinus ponderosa −−−−−−ACCCAGG−TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATTPinus echinata −−−−−−ACCCAGG−TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT

This file is also located in the folder for this chapter and is called conifers.aln and this isthe file we will be working with.

Getting Aligned Sequences Into R

R does not by default recognize sequence data as anything more elegant than a sequenceof characters. As a result, several people have developed libraries for you to use that havea lot of general functionality to them. In this section, I am going to use the library ape. Ifyou do not have this library installed on your machine, see Appendix B for an overviewof the process.

I am assuming that you currently have the data file in a location that you can reach iteasily from within R . To load the aligned sequences into R type the following:

> l ibrary ( ape )> seqs <− read .dna ("confiers.aln" , format="clustal" )> class ( seqs )[1 ] "DNAbin"> summary( seqs )23 DNA sequences in binary format stored in a matrix .

A l l sequences of same length : 526

Labels : Abies alba Abies kawakamii Abies ve i t ch i i Abies homolepis Larix potaninii Cedrus atlantica . . .

Base composition :a c g t

0.310 0.187 0.160 0.343

Biological Data Analysis Using R

154 CHAPTER 9. WORKING WITH STRINGS

There are several things that you can do with these aligned sequences. You can look formotifs, examine CG content, etc. I will leave these options for you to play with later inthe exercises.

Constructing A Neighbor Joining Tree

To construct a Neighbor Joining (NJ) tree, we first need to create a distance matrix thatestimates the distances between pairs of sequences that we have in our file. There areseveral different kinds of distance metrics that you can use in the calculation of thisdistance matrix (see ?dist.dna for more information on these). We will use the default valuewhich is Kimura’s 2-parameter model called ”K90”.

> D <− dist .dna ( seqs )> class (D)[1 ] "dist"> summary(D)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.00000 0.07252 0.09310 0.26890 0.15720 1.45700

The function dist.dna() takes as an argument a set of sequences that you have read in (themust be of class DNAbin as shown above) and spits out the distance matrix. The distancematrix, D, is a particular kind of matrix that holds the lower triangle of the pair-wisedistance calculations. If you print it out, you will get a whole lot of output as it printsthe taxa names for row and column headers.

Since D is a general distance matrix, we can look at the values in it. Figure 9.1 showsa histogram of the distance values that have been estimated in D. From this we see thatthere are several values that are low meaning that the sequences are very similar to eachother and then there are some that are 2-3 peaks that are larger suggesting some degreeof sequence divergence.

To create a NJ tree from these distances, we use the function nj ().

> njTree <− nj (D)> class ( njTree )[1 ] "phylo"> summary( njTree )

Phylogenetic tree : njTree

Number of t ips : 23Number of nodes : 21Branch lengths :

mean: 0.03838704variance : 0.01999758distr ibution summary:

Min. 1st Qu. Median 3rd Qu. Max.−0.0009736 0.0000000 0.0004898 0.0150700 0.8610000

No root edge .First ten t ip labels : Abies alba

Abies kawakamiiAbies ve i t ch i iAbies homolepisLarix potaniniiCedrus atlanticaLarix deciduaCedrus deodara

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 155

Figure 9.1: Histogram of distance estimates among all sequences using the ”K90” model of sub-stitutions

Larix lar ic inaPinus roxburghii

No node labels .

This function take a distance matrix and returns a tree that is of the class phylo. Wecan see that internally the variable njTree has some internal information that may be ofinterest (e.g., branch lengths, etc) but the real way we can understand it is by looking ata graphic of the tree that is produced. To do this, we use the plot () command and pass itthe njTree variable as plot(njTree).2

The topology of the tree (Figure 9.2) is easy to interpret and it is quite obvious wherethose very large distances shown in Figure 9.1 come from. From this topology we cansee that:

1. The Pinus species are generally together forming a polytomy that connects to the

2You may be surprised by the utility of the plot function as it seems to know how to plot everything. Wellin actuality this function is simply a wrapper that takes whatever you pass to it and determines if the classof the object you passed has its own plot command. For the tree, the native command is plot.phylo() and youhave to look up that command to see the available options for it.

Biological Data Analysis Using R

156 CHAPTER 9. WORKING WITH STRINGS

Figure 9.2: Neighbor joining tree based upon the trnL-trnF intergenic spacer sequences and the”K90” model of sequence evolution.

other genera in the family.

2. The Larix, Abies, and Cedrus for generally self contained groups.

3. The most divergent groups are the Picea and Keteleeria samples.

There is quite a bit more that can be done here but I think that is enough to get you onthe right track if you are interested in using R for some basic sequence analysis.

9.2 Producing Formatted Output

Often in the use of R there is a need to produce a particular kind of output from ananalysis of to display the contents of a particular variable. R does a pretty good jobitself, but it has some limitations. For example, you may want to print out a matrix ofvalues but only have 2 decimal places printed for each entry. Or you may want to exporta table of values as HTML so that I can copy and paste it into another program

Biological Data Analysis Using R

9.2. PRODUCING FORMATTED OUTPUT 157

9.2.1 Formatting Strings For Printing

format ( x , trim = FALSE, d ig i ts = NULL, nsmall = 0,jus t i f y = c ("left" , "right" , "centre" , "none" ) ,width = NULL, na. encode = TRUE, sc i en t i f i c = NA,big .mark = "" , big . interval = 3,small .mark = "" , small . interval = 5,decimal .mark = "." , zero . print = NULL, drop0trail ing = FALSE, . . . )

9.2.2 Formatting Tables

A common type of format to be output to another format is tabular data. Tables arecommon features of statistical analysis and as such you will find it necessary to cuta table out of R and paste it into a document in the same way that graphics can beexported from R to be used in your manuscripts and reports.

For these examples, I will just created a matrix of values and add row and column namesusing the functions rownames and colnames.

> x <− matrix ( rnorm(12) ,nrow=3)> x

[ ,1 ] [ ,2 ] [ , 3 ] [ , 4 ][1 , ] 0.1678067 0.8856766 −0.3955881 0.7677516[2 , ] −1.0302831 0.7392326 −0.8333904 −0.3235135[3 , ] 0.4396607 1.7622323 −0.8763023 0.6091688> colnames ( x ) <− c ("Header A" , "Header B" , "Header C" , "Header D" )> rownames( x ) <− c ("Row 1" , "Row 2" , "Row 3" )> x

Header A Header B Header C Header DRow 1 0.1678067 0.8856766 −0.3955881 0.7677516Row 2 −1.0302831 0.7392326 −0.8333904 −0.3235135Row 3 0.4396607 1.7622323 −0.8763023 0.6091688> theMatrixTable <− xtable ( x , caption="Caption For Table" , al ign="l|cccc" )

The variable theMatrixTable now is a xtable object. What we do with it at this point dependsupon how you want to interact with it.

Getting LATEXOutput

If you print it out as is, it will display the contents in LATEX, a typesetting language thatis used to create very nice looking manuscripts and books (this entire book has beenwritten in it). If you use LATEXto write your manuscripts then you are set and the listingthat follows show the formatting that results and the Table 9.1 that follows is what itlooks like when it is inserted into a LATEXdocument.

% latex table generated in R 2.8.0 by xtable 1.5−4 package% Wed Dec 31 14:22:46 2008\begin table [ ht ]\begincenter\captionCaption For Table\begintabular l | cccc\hline

& Header A & Header B & Header C & Header D \\\hline

Biological Data Analysis Using R

158 CHAPTER 9. WORKING WITH STRINGS

Row 1 & 0.17 & 0.89 & −0.40 & 0.77 \\Row 2 & −1.03 & 0.74 & −0.83 & −0.32 \\Row 3 & 0.44 & 1.76 & −0.88 & 0.61 \\\hline

\endtabular\endcenter\end table

Table 9.1: Caption For Table

Header A Header B Header C Header DRow 1 0.17 0.89 -0.40 0.77Row 2 -1.03 0.74 -0.83 -0.32Row 3 0.44 1.76 -0.88 0.61

You can also print the table to a file by calling the function print(theMatrixTable,file=”thefileName.tex”).There are several other options available to you with the print function, see ?print.xtable formore information.

Exporting In HTML for Web or Word

If you do not use LATEXand are a biologist that does a lot of mathematical, programming,or scientific work then you should be. That being said there are many people for which ageneral overpriced and under powered word processor (which shall remain nameless butis buggy and prone to viruses and screwing up your manuscripts, you know which one Imean) is the best you can expect to master. The xtable can be exported into a format youcan open up in said program by first exporting the file as type="html". To export it as suchcall the command > print(theMatrixTable,type=”html”,file=”MyHTMLizedTable.html”) and the table will besaved. You can then open it up in your favorite word processor and it will turn the html

table into a normal table that you can manipulate in your documents. An example of thehtml markup that this function produces is given below and an image of it is presentedin Figure 9.3.

<!−− html table generated in R 2.8.0 by xtable 1.5−4 package −−><!−− Wed Dec 31 14:22:51 2008 −−><TABLE border=1><CAPTION ALIGN="bottom"> Caption For Table </CAPTION><TR><TH> </TH><TH> Header A </TH><TH> Header B </TH><TH> Header C </TH><TH> Header D </TH>

</TR><TR><TD> Row 1 </TD><TD align="center"> 0.17 </TD><TD align="center"> 0.89 </TD><TD align="center"> −0.40 </TD><TD align="center"> 0.77 </TD>

</TR><TR><TD> Row 2 </TD><TD align="center"> −1.03 </TD><TD align="center"> 0.74 </TD><TD align="center"> −0.83 </TD>

Biological Data Analysis Using R

9.3. PLOTTING SPECIAL CHARACTERS 159

<TD align="center"> −0.32 </TD></TR><TR><TD> Row 3 </TD><TD align="center"> 0.44 </TD><TD align="center"> 1.76 </TD><TD align="center"> −0.88 </TD><TD align="center"> 0.61 </TD>

</TR></TABLE>

The HTML above produces a table that when imported into Firefox looks like that pre-sented in Figure 9.3.

Figure 9.3: The html printout of a xtable as interpreted in Firefox. You can also import tablessaved as html into popular word processors and use them as normal table items in the creation ofyour documents.

There are several other options available to you with the print function, see ?print.xtable formore information.

9.3 Plotting Special Characters

There are some special characters that you should be aware of when trying to get yourdata output into a readable format. These characters are not necessarily ones that youspecifically type on the keyboard rather they are ones that are available as their ownbuttons on the keyboard, namely the tab character, the newline character, and the bellcharacter.

All the characters on your keyboard (assuming that you are using an en US keyboard)are specified in as single variables in ASCII (ASCII stands for the American Standard Codefor Information Interchange). Obviously, since the first A stands for American, there area lot of characters that you see on a computer screen that you cannot type directly ona keyboard such as letters with accents, Greek and Latin characters (α, Λ, Ω), and thenthere are all those non-US English characters and hieroglyphs. Your terminal that youare running R from cannot handle these characters but you can get them into plots thatyou make.

R has the nice ability to produce slightly complicated output for the axes of your plotsas well as for putting into most graphics you produce. Items such as subscripts, su-perscripts, and mathematical symbols are easily produced using just a few differentfunctions.

Biological Data Analysis Using R

160 CHAPTER 9. WORKING WITH STRINGS

The primary way for producing formatted text for a graphics output is through the useof the expression function. And the best method for looking at the ability of R to providenice mathy like output is to look at its own demo. So, start R and type:

> demo( plotmath )

This command will show you a short number of tables in a figure window that haveexamples of the different kinds of math plotting that R handles. Associated with eachtable, when R sources the demo script it passes the optional echo=TRUE parameter so thatall the commands that are used to produce the output are also shown in the R commandinterface. This way you can see how each of the cells in the displayed tables is beingencoded. An example of some of the copious output is:

> draw. plotmath . c e l l ( expression ( i t a l i c ( x ) ) , i , nr ) ; i <− i + 1

> draw. plotmath . c e l l ( expression ( bold ( x ) ) , i , nr ) ; i <− i + 1

> draw. plotmath . c e l l ( expression ( bo ld i ta l i c ( x ) ) , i , nr ) ; i <− i + 1

The demo script itself defined the function draw.plotmath.cell() so don’t worry about that part.The part you should focus on is the (expression(bold(x)) parts. There are several options thatyou can pass to the expression function and it is not quite worth listing them all here sinceyou see them in the R demo itself. However, I will show some of the more commonmethods in the plot shown in Figure 9.4.

> x <− rnorm(100)> y <− 23 + 1.4∗x + 2∗rnorm(100)> plot ( x , y , bty="n" , ylab=expression (X[ s tuf f ] ) , xlab=expression ( chi ˆ2 ) , col="red" )

For both the x− and y-axes, I use the expression function to create labels with subscriptsand superscripts. If you like, you can define these values as individual variables prior toplotting if you like to keep the plot command a bit cleaner, there is really no difference inthe speed at which R would evaluate them. Here is another example:

> xlabel <− expression ( bold ( x [ i ] ) )> ylabel <− expression ( i t a l i c ( x [ i ] ˆ 2 ) )> plot ( x , y , bty="n" , xlim=c (0 ,20) , type="l" , lwd=2, col="blue" , xlab=xlabel , ylab=ylabel )

Look at the demo(plotmath) output to see the diversity of plotting approaches.

Biological Data Analysis Using R

9.4. USEFUL FUNCTIONS 161

Figure 9.4: Example of using the expression function to annotate a graphic.

9.4 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• any(x,y) Returns a logical response to x having any instance of y in it.

• cat(x) Concatenates the objects in x and dumps them out to the interface.

• format(x) Formats the object x for rigid (some say pretty) printing.

• substring(x,s,f) This returns takes the string in x and returns the substring startingat position s and finishing at position f .

• strsplit(x,c)functions!strsplit Splits the string x on the character (or characters in c).

• nchar(x) Returns the number of characters in the string x.

• expressionx This function takes the variables in x and turns them into a string ex-pression to be plotted in a function.

Biological Data Analysis Using R

162 CHAPTER 9. WORKING WITH STRINGS

• nj(x) This function performs the neighbor joining function on the distance matrixx.

• unlist(x) Takes the list x and returns it as a vector.

Biological Data Analysis Using R

9.5. EXERCISES 163

9.5 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. Create a table from the data data <−matrix( rnorm(9), nrow=3 ) and label the rows as c("Richmond","Petersburg", "Varina") and the columns as c("PPM(A)", "PPM(B)", "PPM(C)". Use the xtable

library to export this table as HTML and then import it into your answers. (Thisis a very helpful methodology for getting formatted data out of R and into yourmanuscripts).

2. Create a table of the different words found on the first page of the Chapter entitledPreface in this text.

3. Using the strsplit function to break apart the raw text of the first four paragraphs ofthe Chapter entitled Preliminaries into sentences (HINT: use the ”.” as the character to break apart the string on and you can copy and paste it fromthe pdf). Then use the grep command to find the sentences that have the word are inthem.

4. Show how you would use the sub command to fix the sentence, ”Dr. Dyer is a loser”?(And when I say ”fix” I mean make it say that I am not...)

5. How many characters are in the first paragraph of this Chapter?

6. Create a density plot of the χ2 distribution and make the main label say ”χ” using theexpression() function (hint: this character is called ’chi’).

7. In the previous graph, plot a dotted vertical line to indicate where the mean value ofthe distribution is and put the character ”µ” symbol next to it.

8. Using the aligned sequences to create a few different distance matrices by changingthe model type that you pass to the function dist.dna(). Do alternate distance modelshave different densities of values? (Hint: plot a density plot for each distance matrixon the same graph similar to what is shown in Figure 9.1).

9. Do these different distance models produce different tree topologies when using thenj () function? If so, show the trees and describe the differences you see in the trees.

10. Do the functions nj () , fastme.bal(), and bionj () produce the same looking topologies? Youshould read the functions to see what they are as you probably haven’t worked withthem yet. Explain.

Biological Data Analysis Using R

164 CHAPTER 9. WORKING WITH STRINGS

Biological Data Analysis Using R

Part III

Extending R

165

Chapter 10

Basic Scripts

When I use the term script here, what I am referring to is a set of R commands that youput into a text file and have R evaluate. Learning how to write scripts will help you outin the following ways:

1. In general we are all lazy. It seems to be a monumental task to type the same thinginto R over and over again. Scripts allow you to put your commands into a text fileand have R run them for you.

2. Keeping your analyses and data sets together is a great way for you to not loosea record of what you have done. At a later date, you can come back and pick upwhere you left off. If you have more data or another angle at the analysis, having arecord of how the previous analyses were performed is a huge benefit.

3. There are times when you have to do the same thing over and over again, say makegraphs of a large number of variables or transform a lot of different data sets usingthe same algorithm. If you put the commands in a script, and later when we getinto programming (Chapter 11) and functions (Chapter 12) you can run it over andover again with ease (remember the lazy thing?).

So in essence, scripts are enablers for our laziness.

In this chapter, you will focus on the following topics:

• Learn about basic script writing

• Understand differences between code evaluated from a script and that same codetyped into the interactive R command line

• Execute scripts in R

10.1 Writing Scripts

A script is nothing more than a series of commands that R recognizes and evaluates.Within a script, you can define data (Chapter 2), functions (Chapter 12), or other oper-

167

168 CHAPTER 10. BASIC SCRIPTS

ations. It is convenient to have a record of the commands that you use in R to produceoutput.

10.1.1 Knowing Directories

A script must be in text and it must reside in a location where you can tell R it is located.When you start an interactive session in R , it notes the current directory that you areusing. This is what is called the cwd or current working directory. Now if you are using Rfrom a GUI-ish installation such as on Windows , you have to tell R which directory toOSuse as a starting place. You can change the cwd from the “Change dir...” command in the“File” menu. If you are staring R from a terminal (in OSX or some Unix variant), thenthe directory where you started will be the cwd.

Here are a few tips that I find helpful when I work with R :

• It is a pretty good idea to keep your data sets and the scripts that you use to analyzethese data in the same directory. Use your descriptive skills in naming your dataand scripts such that you know what is contained in the file without looking atit (e.g., perhaps a data set named DogwoodGerminationRates27.csv and the Rscript as AnalysisOfDogwoodGermination.R; just makes it easier).

• It is also a good idea to make sure that you separate your directories of data andassociated scripts such that it is easy for you to find the right directory. Keepingit all mashed together into a single directory can cause problems with data setshaving the same name (e.g., the infamous data.txt).

• Always provide labels for each column of data. At some time in the future you willneed to look at the data set and figure out what that column of data represent.

• In your scripts, provide a lot of comments. Lines that start with the hash character(#) are ignored by R and you can use them for adding comments about what thescript, program, functions, or variables actually mean. I cannot emphasize thisenough. You will leave this class and at some point in the future look back onsome script you wrote and want to figure out how it works and without copiouscomments you will fail and have a small sense of being genuine looser. You havebeen warned.

10.1.2 The Editor

You can write a script in any basic text editor. For some installations of R , there is apseudo-GUI associated with it (e.g., Windows) because there is no real command lineterminal in the OS. This interface to R often has an integrated editor built into it andif it is there you should probably use it unless you have another editor of choice thatyou feel more comfortable with.1 If you do not want to use the supplied editor or do nothave one available, you may want to check out TEXTMATE or TEXTWRANGLER on OSX,

1There have literally been decades of wars fought over the choice of the real editor. If you are interestedin cultural aspects of programming and programmers (e.g., nerds like myself) fire up a google search for ”vivs. emacs” and sit back and enjoy.

Biological Data Analysis Using R

10.2. EVALUATING SCRIPTS 169

E or CRIMSON EDITOR (or the million others that are on this pedestrian platform) onWindows, and for Unix/Linux you can use GEDIT, KATE, EMACS or VI (n.b. If you learnone these last two you will never need another editor on any platform).

The important component of the editor that you are looking for is one that understandsR (or SPlus) and can provide you with syntax highlighting, parenthesis matching, andautomatic indentation. These are things that just make your life easier. After all, if youare going to be spending a lot of time in front of your computer, you may as well havetools that help instead of get in the way. Speaking of getting in the way, you shouldnever, under any circumstance, even think of using Word to do any of this.

OK, so open your editor and we will make a very small script that does something entireuseless. There is a data set named ScriptExampleData1.txt in the class folder. Make sureyou script is saved in the same directory as the data file. In R type the following codeand see what happens.

> theData <− read . table ("ScriptExampleData1.txt" ,header=T, sep="," )> summary( theData )

Population Height SexA:5 Min. :23.40 Female:5B:4 1st Qu.:27.70 Male :4

Median :29.70Mean :30.043rd Qu.:32.70Max. :38.20

> range ( theData$Height )[ 1 ] 23.4 38.2> l eve ls ( theData$Population )[1 ] "A" "B"

It should have loaded theData and provided a summary of it as shown. If not, you areprobably not in the correct directory. Change to the right directory and redo.

Now, take the same code and put it into your script file. Obviously, you do not want tocopy the responses that the R engine had provided to you, just the commands that youtyped. Save the script as AnalysisOfScriptData.R (note you must have the .R suffix on thescript file). Congratulations, you have written your first script. In the next section wewill evaluate the script and note a few differences.

10.2 Evaluating Scripts

The R engine can load and evaluate scripts relatively easily. Take a look at the docu-mentation for the source() command by typing ?source into R and give it a read. OK, ready?In R type source("AnalysisOfScriptData.R") and see what happens... Nothing. Why is this?The same commands produced lots of output when typed directly into R ...

The issue is that when you are typing commands into R you are doing so in an interactivemode. You say ”do this” and it says ”OK.” However, when you are executing the contentsof a script, it is not entirely clear where output should go, another file, to the screen,some other place. As a result, if you want to get a response from stuff in a script youneed to tell R to print the results. So for example, if you change your script to looklike:

Biological Data Analysis Using R

170 CHAPTER 10. BASIC SCRIPTS

theData <− read . table ("ScriptExampleData1.txt" ,header=T, sep="," )print (summary( theData ) )print ( range ( theData$Height ) )print ( l eve ls ( theData$Population ) )

and from R source it you’ll get:

> source ("AnalysisOfScriptData.R" )Population Height SexA:5 Min. :23.40 Female:5B:4 1st Qu.:27.70 Male :4

Median :29.70Mean :30.043rd Qu.:32.70Max. :38.20

[1 ] 23.4 38.2[1 ] "A" "B"

Again, notice that here the output was only the response of the commands, the com-mands themselves were not echoed to the R environment. You can get R to echo eachcommand and then provide the results when it is in a script by adding the optionalecho=TRUE option to the source() function as shown in the output below:

> source ("AnalysisOfScriptData.R" ,echo=TRUE)

> theData <− read . table ("ScriptExampleData1.txt" ,header=T, sep="," )

> print (summary( theData ) )Population Height SexA:5 Min. :23.40 Female:5B:4 1st Qu.:27.70 Male :4

Median :29.70Mean :30.043rd Qu.:32.70Max. :38.20

> print ( range ( theData$Height ) )[ 1 ] 23.4 38.2

> print ( l eve ls ( theData$Population ) )[ 1 ] "A" "B"

This is helpful if you are debugging a script (e.g., figuring out why it is crashing or givingyou the wrong answers).

So, in a script, things won’t be printed out to the R terminal unless you tell it to. Andit is relatively appropriate to ask why you are wanting some things printed out as thescript is executing. The variables in a script are available in the main R memory so ifyou define a new variable in the script, after the first time you source() it, you will haveaccess to it. However, because you can add variables to the main memory of R from ascript, I typically erase all variables from memory at the beginning of each script usingthe command rm( list=ls () ). This way it is easy to see that the variable x you are workingwith is the real one and not another x you had used two hours ago. This is a veryimportant point. Again, we are thinking about the future here and we need to make surethat the things that we do in our analyses are reproducible at some point in the future.Relying on variables that are outside our script and are only memory because we didsomething before running our scripts will lead to frustration (bet on it!).

Biological Data Analysis Using R

10.3. ADDING COMMENTS TO YOUR CODE 171

In Chapter 9 there was a more complete discussion of how you can format your data forprinting. As you begin writing scripts right now, just focus on writing the routines thatyou need to use to get an answer and later you can focus on making it look pretty.

10.3 Adding Comments To Your Code

Speaking of looking pretty, you must add comments to your code so that you rememberwhat is going on inside that file. To comment code in R you put a hash character atthe beginning of the section that you want to be commented. This will comment the linefrom that point to the right. Everything to the left of the hash character is consideredcode that will be evaluated.

x <− 20 # this comment w i l l l e t the assignment happen

# this is a comment that spans multiple l ines and won’ t# be evaluated even i f i t has log ica l R code in i t# x <− 21

print ( x )

Empty lines are also a nice feature to sprinkle through your scripts so that logical par-titions can be identified. The R interpreter ignores all commented material and all linesthat do not have anything on them, so you are not penalized for not having it there.

Biological Data Analysis Using R

172 CHAPTER 10. BASIC SCRIPTS

10.4 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• # Indicates the start of a comment. The R interpreter ignores everything to the rightof this symbol.

• rm(x) This function removes the variable x (or if x is a list of variable names all ofthem) from memory.

• source(x) This function causes R to look for the script named x and evaluate itscontents from start to finish. This works just as if you had typed in the lines of thescript with the exception of how variables are printed out to the terminal.

• cat(x) This function dumps the contents of x to the GUI output as a single entity.

• print(x) Send the contents of the variable x to the terminal output.

• summary(x) Provides a summary of the variable x.

Biological Data Analysis Using R

10.5. EXERCISES 173

10.5 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. How do you remove all variables from memory in the current workspace?

2. What happens when you set the optional argument verbose=TRUE when calling source?

3. Are you lazy?

4. Can R evaluate scripts that are written in Word or Excel?

5. How do you change the current working directory in R ?

6. How does the optional argument echo=TRUE change in the output of sourcing a script inR ?

7. How would you print the summary of a data frame from within a script?

8. What character is used to indicate a comment?

9. How would you comment out several lines of code in a script?

10. Why is it important to comment your code?

Biological Data Analysis Using R

174 CHAPTER 10. BASIC SCRIPTS

Biological Data Analysis Using R

Chapter 11

Programming

Programming is the art of making a computer, who understands that it has to do onlyexactly what you tell it to do, to get it to do the things you want to do. The language that Ruses for programming is derived from S-Plus and will be familiar looking to anyone whohas programmed in another language or seen other programming languages before.

In general, the majority of programming in R will be very linear. While it is possible toprogram in an object-orientated fashion (and indeed it is not that bad of an implemen-tation in my opinion), I won’t be covering that in this book. The programs that I’ll helpyou build will have a start, proceed through a set of operations, have some conditionalstatement, perhaps some loops, print out some stuff or save it to a file, and then exit. Ifyou have never programmed before you need to think about programming as a kind ofrecipe, a very precise one. You need to think about the problem that you are going tosolve by writing a program. And then you need to think about the exact steps that youwill need to do to accomplish what you are attempting to do.

In this chapter, we will tackle a rather easy problem as a test case to show off how toconstruct a very simple program. The problem that we are going to deal with is how tomeasure canopy light from a hemispheric photo. An example of a Hemispheric photo isgiven in Figure 11.1. This photo was taken by S.B. Weiss from the winter roosting habitatof the monarch butterfly in the Monarch Biosphere Reserve, Mexico. In this image, it iseasy to see the amount of canopy closure when taken from the hemispherical lens.

What we are going to do in this chapter is determine how much of that image is opensky as a surrogate to measure available light in these forests. In the next few sections,I will show some basic programming tools that we will use to write this program. Thenwe will walk through the loading of an image and discuss how we get information fromand manipulate image data. Finally we will set out to write the program in a step-wisefashion and finish with the completed program.

In this chapter, you will focus on the following topics:

• Be introduced to some basic programming logic and the corresponding R grammar.

• Develop a detailed pseudocode for a given program.

• In a step-wise fashion, develop and test the program.

175

176 CHAPTER 11. PROGRAMMING

Figure 11.1: Hemispherical photograph of winter roosting habitat at Monarch Biosphere Reserve,Mexico. Photo by S.B. Weiss made available by the Creative Commons Atribution 2.5

11.1 Looping

As mentioned in Chapter 2, R is primarily a vector language. The consequence of thisis that if you are looking for a language to do fast loops through a data set, R is not it.In fact, Perl or Python would actually be faster to do looping-like algorithms. That beingsaid, there are reasons we occasionally need to use loops in R and here is a generaloverview.

A loop, when referred to in a programming language, is a sequence of statements thatare repeated over and over again until some condition is reached. The items inside theloop are typically contained within curly brackets (e.g.,).

11.1.1 The While

The while looping metric is a good one to use if you have a particular condition which youwant to check over and over again and perform some operations as long as the conditionis in one state. The while loop has the form while(COND) <code goes here> . The COND term

Biological Data Analysis Using R

11.2. CONDITIONAL STATEMENTS 177

in the parenthesis is evaluated as a logical statement each time you go through the loopand will continue as long as COND=TRUE. When COND=FALSE, the loop exits and R starts toevaluate statements after the closing curly bracket. There can be a lot of code betweenthe brackets.

The following example loops as long as x < 10 and prints out the value of x each timethrough the loop.

> x <− 0> while ( x < 10 )+ x <− x + 1+ cat ( x , " " )+ 1 2 3 4 5 6 7 8 9 10

When you start looping here, x = 0 and at each time through the loop, the variable x isincremented and printed out on the console.

11.1.2 The For

Another common loop is one that actually focuses on the value of a counting variable(e.g., the index in the loop). What this looping metric does is combine the initializationof the condition variable (a counter) as a numeric value, increment the counter eachthrough the loop, and exits the loop when some condition on the counter is correct. Thegeneral form of the for statement is for( COND ) <code goes here> . The COND can be oneof many different constructs that sets up a counting variable. Here are some examplesusing the variable x.

> for ( i in seq (0 ,9 ) )+ cat ( i )+ 0123456789> for ( i in 0:9) + cat ( i )+ 0123456789> x <− seq (0 ,9)> for ( i in seq ( length ( x ) ) ) + cat ( x [ i ] )+ 0123456789> for ( i in x )+ cat ( i )+ 0123456789

For the COND the variable i is used as the counting variable along with the keyword in.

11.2 Conditional Statements

The next tool in your R programming toolbox is the conditional statement. Conditionalstatements control the flow of logic through the a script or program. There are many

Biological Data Analysis Using R

178 CHAPTER 11. PROGRAMMING

cases where you would like to run some command or sets of commands if some conditionis true. For example,

if( CONDITION ) then RESPONSE

else if( OTHER_CONDITION ) then OTHER_RESPONSE

else FINAL_RESPONSE

Here the logic asks about the state of CONDITION, and OTHER CONDITION. If CONDITION is TRUE thenRESPONSE is done and none of the other conditions are evaluated nor are their responsesperformed. The R interpreter just skips everything until the end of the set of condition-als. If CONDITION is not TRUE but OTHER CONDITION is, then the only response to be performedis OTHER CONDITION. If neither CONDITION nor OTHER CONDITION are true then FINAL RESPONSE is per-formed. Note, only one response is ever performed each time.

In the example below, I set up a vector of boolean (TRUE|FALSE) variables and then loopthrough them one at a time and see what they

> observations <− as . log ica l ( c (TRUE, FALSE, FALSE, TRUE, TRUE) )> observations[1 ] TRUE FALSE FALSE TRUE TRUE> for ( obs in observations )+ print ( obs )[1 ] TRUE[1 ] FALSE[1 ] FALSE[1 ] TRUE[1 ] TRUE

> for ( obs in observations )+ i f ( obs == TRUE )+ cat ( obs , "it is true \n" )+ else+ cat ("not\n" )+ TRUE i t is truenotnotTRUE i t is trueTRUE i t is true

We can also use conditional operators as a CONDITION in a if statement. In the examplebelow, we cycle through the numbers 1 through 10. And for each of them, we determine ifthey are odd or even using the modulus operator %%. This operator returns the remainderafter a division.

> for ( i in 1:10)+ i f ( i %% 2 )+ cat ( i ," is odd\n" )+ else+ cat ( i ," is even\n" )+ 1 is odd2 is even3 is odd4 is even5 is odd6 is even7 is odd

Biological Data Analysis Using R

11.2. CONDITIONAL STATEMENTS 179

8 is even9 is odd10 is even

Each time through, the remainder of i %% 2 is evaluated. Possible values for this are1 and 0 which when evaluated as.logical () , turn out to be either TRUE or FALSE printing theappropriate message.

11.2.1 Bracketing

There is a little bit of bracket magic going on here and I should take the time to makea few comments. Notice in the previous listing, there were brackets surrounding thecontent inside the for loop. These brackets are essential because there is more thanone line of code inside the for loop. If there were only one line (see previous code listingwhere print(obs) is the only code inside the for loop) then the enclosing brackets areoptional.

As a general rule, after any conditional (e.g., the if/else if/else) or loop (e.g., while/for) ifthere is only one line of code then you do not need to use brackets if you do not want to.Examples include:

> i f ( rnorm(1 ) > 0.5 )+ print ("greater" )> while ( TRUE )+ print ("this will last forever" )

This rule is recursive in that the “one line of code” is any line that is not a conditionalor a loop. In the next example, I loop through the numbers 1-10 and look for thoseeven numbers that are not divisible by 4 (n.b. I could have used a compound condi-tional statement such as if( !(i%%2) && (i%%4)) but that would have really screwed up myexample).

> for ( i in 1:10)+ i f ( ! ( i %% 2 ) )+ i f ( i %% 4 )+ cat ("the value=" , i ,"\n" )the value= 2the value= 6the value= 10

In some sense, you can think of these kinds of “one-liners” as just extensions as one-offs. There is nothing wrong with using brackets even in these cases. In fact, it mayopen up your code a bit and make it a bit easier to read in the future. You just do nothave to use them.

However, where you want more than one statement to be executed after a loop or condi-tional statement then you must use brackets. T

Biological Data Analysis Using R

180 CHAPTER 11. PROGRAMMING

11.3 Outlining A Program

The most difficult part of programming is understanding where to start. Writing a pro-gram, on the surface, appears to be a daunting task in intself. However, when I writeprograms I tend to think of them not as a single large program but as a series of smallersteps. The key to doing this is to understand the sequence of steps that we need toaccomplish so that the program can do what is required.

So, first things first. State what you want the program to do in specific terms. Forthis Chapter we will be working on developing a program that calculates the amount ofcanopy openness from a hemispheric image (Figure 11.1). If you haven’t already doneso, I recommend that you look at Chapter 7 to refresh yourself on how we work with theinternals of an image.

Next, we need to get out a sheet of paper and write down, exactly, how the program isgoing to work. It is important that we include all the steps necessary and in the order inwhich they are to be performed. An example of this would be:

1. Load image into memory

2. Determine what parts of image are ”open canopy”

3. Determine total area of image

4. Print out the proportion of canopy that is open.

So, each of these steps is a relatively easy one by itself and we will create the overallprogram by breaking it up into manageable parts.

11.4 Creating A Program

It is often necessary to incrementally build a program. Using the outline in the previoussection, we can open a new file and create a script that does each of these items in suc-cession. Typically, I find it helpful to work on the R command line to test out particularsets of commands and when I have it exactly like I like it then I move it to a script.

11.4.1 Step 1: Loading An Image Into Memory

In Chapter 7, we examined how to load images into memory, translate them into vari-ous formats, and get into their knickers, so to speak. So to begin with, the image as Iretrieved it from Wikipedia is a JPEG image. I will begin by turning it into a PPM formattedimage as discussed in Chapter 7 using the program GIMP (http://www.gimp.org), al-though you could use any image manipulation program and there are several free onesavailable for you on the internets. The PPM file is what you have access to in the classfolder for Chapter 11.

> l ibrary (pixmap )> img <− read .pnm( f i l e ="Hemiphoto monarch habitat1.ppm" )Read 637563 items

Biological Data Analysis Using R

11.4. CREATING A PROGRAM 181

Figure 11.2: The blue channel of thecanopy picture displayed as a greyscale im-age.

Figure 11.3: A histogram of values in theblue channel (Figure 11.2).

> plot ( img )

Now we have the image loaded and a plot that is identical to that displayed in Figure11.1 and we must figure out how to have it represented.

11.4.2 Step 2: What Is “Open Canopy”

The variable img has the following components and here we need to figure out what partsof the image are the sky parts.

> names( attr ibutes ( img ) )[ 1 ] "size" "cellres" "bbox" "bbcent" "channels" "red" "green"[ 8 ] "blue" "class"

Remembering that there are three different channels in a PPM file, one for red, one forgreen, and one for blue, perhaps we should look there first. You can plot each of thechannels as an image by creating a pixmapGrey() image and see the intensity of each colorchannel.

> plot ( pixmapGrey ( img@blue ) )> plot ( pixmapGrey ( img@red ) )> plot ( pixmapGrey ( img@green ) )

And from this you will see that the different channels look pretty much the same whenevaluating the area that is considered the “sky” in this image. For our purposes, I willwe will only use the blue channel as displayed in Figure 11.2.

Biological Data Analysis Using R

182 CHAPTER 11. PROGRAMMING

So if that is the component of the image that we are going to use, we now need to deter-mine which values to look for. To do this, you can easily make a histogram composedof the values in the blue channel of the image using the command hist( img@blue ). We cansee from Figure 11.3 that there is a tremendous amount of values in this channel at thelow end, a peak at around 0.2 and another at the top end close to 1.0.

We can get a bit more specific with this image and plot the intensity of a particular rowof values in the blue channel to double check that we think values close to 1.0 shouldrepresent light values and those near 0.0 are the dark regions. The following commandscreate the image displayed in Figure 11.4 where the raw values along the 230th row ofpixels (indicated by the red dashed line) are shown in blue. It is easy to see that thevalue in the blue channel gets larger as the dashed line crosses the image.

Figure 11.4: Intensity of blue channel values in the image as taken through a slice of the image(at pixel row 230 as indicated by red dashed line).

> plot ( img , axes=T, bty="n" , xlab="Image Width" , ylab="Image Height" )> par (new=T )> abline (230 ,0 , col="red" , lwd=2, l t y=2)> par (new=T )> par (new=T )> plot ( img@blue[230 ,] , bty="n" , type="l" , xlab="" , ylab="" , col="blue" ,+ lwd=3, axes=F, ylim=c(−10 ,10))

So, at this point, we need to make a value judgement. We are fairly confident that valuesclose to one in the blue channel (and others you can go check yourself) represent areasin the image where it is pretty light. But, we need to make a cut-off such that if we lookat a pixel, we can put it into the light or not-light category. For the purposes of thisexercise, I will assume that values that are ≥ 0.98 are to be considered as sky and I willalso make the restriction that I need the pixels in each channel to meet or exceed thiscut-off.

Now, to find out how much of the image is sky (using this definition), we must:

Biological Data Analysis Using R

11.4. CREATING A PROGRAM 183

1. Loop through every matrix and the items in each matrix.

2. Evaluate if the value should be considered as sky or not.

3. Use a variable to keep track of all the pixels that meet the criteria

So to our script, we will add the following lines of code

> numRows <− img@size [1 ]> numCols <− img@size [2 ]> for ( row in 1:numRows )+ for ( col in 1:numCols )+ i f ( img@red [ row , col ] >= 0.98 &+ img@green [ row , col ] >= 0.98 &+ img@blue [ row , col ] >= 0.98 )+ numSky <− numSky + 1+ + > numSky[1 ] 9624

So, in the image across all three color channels, we find a total of 9, 624 pixels that canbe considered to represent the sky.1

11.4.3 Step 3: Determine The Total Area Of The Image

OK, finally we are almost finished. We need to now determine what the total number ofpixels there are in the image so that we can get a standardized percent of open canopy.We could use the total number of pixels 4612 = 212, 521 but the image taken with thefish-eye lens is not square, rather it is a circle that fits in a square whose side has 461pixels. So, we need to figure out the area of this circle as:

> r <− 461/2> totalArea <− pi ∗ r ˆ2> totalArea[1 ] 166913.6> (461ˆ2− totalArea )/totalArea[1 ] 0.2732395

As a side note, the last expression in the code listing shows what percentage of area thatwe would bias our estimation by if we just used the total number of pixels in the image,27.3% is a reasonable sized bias!

11.4.4 Step 4: Print Out The Proportion Of Canopy That Is Sky

This part is fairly easy and doesn’t require much.

1While this part of the exercise was excellent at showing some of the programming paradigms and howthey can be combined to give an answer, it is also true that Step 2 can be accomplished in R using theone-liner sum( img@blue >= 0.98 &img@green >= 0.98 &img@red >= 0.98 ). Here the three conditionals return avector of logical variables, which the function sum() coerces into integers. While it would have been muchshorter to do it this way, it would have negated all the quality teaching experiences that I was laying onyou...

Biological Data Analysis Using R

184 CHAPTER 11. PROGRAMMING

> numSky / totalArea[1 ] 0.05765857

11.4.5 The Complete Program

The complete program is listed below with comments. There are a few changes in theprogram that I made to make it a bit easier to work with. Comments should be selfexplanatory and are indicated by lines that start with the hash character (#).

# removes a l l variables from memory at start of scr iptrm( l i s t = ls ( ) )

# load the pixmap l ibrary to open the imagel ibrary (pixmap )

# I put the f i l e name into a variable so# i t could be changed easi ly at the top# of the f i l e i f necessaryfileName = "Hemiphoto monarch habitat1.ppm"

# I also put the c r i t e r i a into a variable# so we can change i t in one place to see# how the results d i f f e rskyCriteria <− 0.98

# Read in the image and find the number of# rows and columns in i timg <− read .pnm( f i l e =fileName )numRows <− img@size [1 ]numCols <− img@size [2 ]

# Loop through each rowfor ( row in 1:numRows )

# Loop through each columnfor ( col in 1:numCols )

# Evaluate the ce l l in each for# ‘ sky cr i ter ia ’i f ( img@red [ row , col ] >= 0.98 &

img@green [ row , col ] >= 0.98 &img@blue [ row , col ] >= 0.98 )

numSky <− numSky + 1

# Find tota l are of f isheye c i r c l er <− numRows/2totalArea <− pi ∗ r ˆ2

# Print out the percent capercentCanopyOpen = numSky/totalAreacat ( ‘ ‘ Canopy Opening : ‘ ‘ , percentCanopyOpen , ‘ ‘\n’’ ) ;

11.5 Synopsis

This has been a very simple little program that we made. Despite it being simplistic, itdoes show you how to go about creating a simple analysis program. R is not a general

Biological Data Analysis Using R

11.5. SYNOPSIS 185

programming language and you are not going to make large programs with it. The keyto R is knowing how to get something put together, take it a step at a time, and breakthe components into reasonably sized, easy to accomplish pieces. This is where youstart.

In Chapter 12 we will build upon what has been done here when we discuss Functions.We can encapsulate code into functions and make our lives much easier. For now, playaround with the program and the exercises and get comfortable with typing code.

Biological Data Analysis Using R

186 CHAPTER 11. PROGRAMMING

11.6 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• x %% y The modulus operator. This returns the remainder of the division x/y.

• as.logical(x) Coerces x into a logical variable if possible. See 2.4.7 for more infor-mation on logical variables.

• rm(x) This function removes the variable x from memory

• abline(a,b) This function plots a line with intercept of a and a slope of b in the currentgraphics window.

• for(INDEX SEQUENCE ) A main looping construct that specifically uses the counter IN-DEX that is contained in SEQUENCE.

• while(COND) A looping construct that continues to loop until some condition is met.As long as COND==TRUE the loop will continue.

• if(COND) The evaluation of the condition COND. If it is TRUE then the next line followingthe if statement is executed. If it is FALSE then the next line is skipped. You caninclude several lines to be evaluated after this and other evaluation statements byenclosing the code in curly brackets .

• else if(OTHER COND) The second evaluation of a condition. This must not be the firstconditional (e.g., there is an else here that implies a previous if or else if statementthat this is following).

• else The last of a conditional, if all the previous ones did not turn out to be true,then whatever follows the else will be evaluated. It is not necessary that you haveone of these at the end, you may want to not do anything unless some specificconditions occur.

Biological Data Analysis Using R

11.7. EXERCISES 187

11.7 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. Write a short program that lists all numbers from 1 to 100 and determines if they aredivisible by 2 and 3.

2. Write a program using the for () loop that prints the numbers from 42 down to 27, oneon each line.

3. List some of the assumptions that are included in how the variable numSky is deter-mined.

4. Using the program we created in this Chapter, make a graph of percent canopy withdifferent cut-off values. In your opinion what would be the most biologically mean-ingful cutoff?

5. Change the program to use a cutoff value based on the sum of the individual colorchannel values rather than the current requirement that they all be simultaneouslyover some threshold.

6. How many lines of output do you expect to get from the following code? HINT: Thinkbefore you try to run this program.

while (1 ) cat ("All work and no play makes Dr. D a dull boy.\n" ) ;

7. Create an outline of the steps that would find the number of values in a matrix thatis equal to or greater than 20.

8. Implement the program you outlined using the matrix M <−matrix( runif(25,10,30), nrow=5) asyour input. Make sure to comment your code appropriately.

9. What is the proper syntax for conditions passed to an if statement requires x to begreater than 23 and y to be equal to or less than 4?

10. How many else statements can you have after an if statement?

Biological Data Analysis Using R

188 CHAPTER 11. PROGRAMMING

Biological Data Analysis Using R

Chapter 12

Functions

Throughout this book, we’ve used both built-in function such as sqrt() and sum() as well assome that are located in external libraries that we had to load (such as skewness() in 4.3.1and read.pnm() in 7.2). These functions have been really helpful in making you scriptslook clean and readable and have made you life rather easy as you performed somebasic statistical analysis. Think what a pain it would have been if you had to write codeevery time you wanted to calculate a sqrt() of a number... (I’m not even sure how it isdone).

Writing your own functions in R is a very useful way to save a lot of typing. You canconsider a function a small self-contained bundle of instructions that you can call whenevery you need to. Say you are picky about the way your graphics look, or that thereis a particular set of routines that you use to make translations of your data from oneformat to another. Putting this code into function and putting that function in a locationwhere you can get access to when every you need it is a real treat.

In this Chapter you will learn the following skills:

• Learn the syntax required to write your own functions.

• Understand the scope of a variable and why you should care.

• Create a basic library of routines that you can use in the future.

12.1 Function Syntax

The format of a function basically has the following three parts:

1. The name of the function. The creation of a name for a function is just as importantas for a variable. I find it helpful to try to make the name tell me what the functiondoes (I’m funny that way), which means it typically starts with a verb such asconvertMissingData(), removeLameExcuses(), or makeTheGraphTheWayILikeIt().

2. The assignment to the name. Right after the name you will have the assignment ofthe generic function() function to the variable (see the syntax below). This tells R thatthe name is not a variable but will actually be the name of a function.

189

190 CHAPTER 12. FUNCTIONS

3. The function contents. This is the part that you get to write. Here is where you putall the stuff together to do whatever it needs to do.

In general, these three parts are put together to look like:

doMyBidding <− function ( ) # Function Contents

Now this is fairly boring function here, it takes no arguments and doesn’t return any-thing to you. It is kind of like saying, ”R go to your special place and do somethingbut don’t tell me what it is.” As you write functions, they will be considerably morecomplicated (and hopefully useful).

In this Chapter I will post in the raw code for the function itself followed by the outputof R from the command line. The straight posting of the function syntax allows you tocut-and-paste them into the R interpreter (even though you will learn it better by typingit).

Also, functions that you have defined are available in the local memory of the interpreterin the same way as local variables are. If you use the ls () command to list the items inmemory it will show your function names along side your variable names.

> l s ( )[ 1 ] "doMyBidding" "x"

12.1.1 Returning Values From A Function

Most likely you are calling some function because you are interested in getting a re-sponse to it. It is not common to write functions that do no give you something back inreturn.

To return a value from a function, R has you put the name of the variable on the lastline of the function. An example of this is the following function that returns a singlenumber.

gimmeANumber <− function ( )42

> gimmeANumber ( )[ 1 ] 42> gimmeANumber ( )[ 1 ] 42

And a slightly better function here that actually returns a random number:

gimmeAnotherNumber <− function ( ) x <− runif (1 ,1 ,100)x

> gimmeAnotherNumber ( )[ 1 ] 87.3278> gimmeAnotherNumber ( )[ 1 ] 64.97312

Biological Data Analysis Using R

12.1. FUNCTION SYNTAX 191

You can also use the return return() to exit the function and potentially return a value.Here is an example that checks to see if the passed argument is the right kind, if it isnot it prints an error and returns, otherwise it performs a calculation and then returnsthe result.

gimmeHalf <− function ( theValue )

# check to see i f i t is a numeric value# i f i t is the return halfi f ( i s . numeric ( theValue ) )

return ( theValue / 2.0)

# i f i t isn ’ t then complainelse

cat ("The value" , theValue ,"is not a number, try again.\n" )return ( )

> gimmeHalf ( 12 )[1 ] 6> gimmeHalf ( "Hello partner !" )The value Hello partner ! is not a number, try again .NULL

Notice here that when the function left the else section of the function by calling thereturn() without any arguments then the function actually returned the NULL value. If youare not interested in having a function return NULL, something that signals to you thatthe value passed to the function may be incorrect then you can remove the last return()

statement and have the function not return anything. Here is what that function wouldlook like.

gimmeHalf <− function ( theValue )

# check to see i f i t is a numeric value# i f i t is the return halfi f ( i s . numeric ( theValue ) )

return ( theValue / 2.0)

# i f i t isn ’ t then complainelse

cat ("The value" , theValue ,"is not a number, try again.\n" )> gimmeHalf (14)[1 ] 7> gimmeHalf ("bob" )The value bob is not a number, try again .

Vector Arguments

By default you function above can work on vectors of values just as easy as singlenumbers. This is because a vector of numbers will return TRUE when asked if it is.numeric()

(see 2.4.8 for more on this). Here is an example,

> x <− seq (2 ,20 ,by=3)> i s . numeric ( x )[ 1 ] TRUE

Biological Data Analysis Using R

192 CHAPTER 12. FUNCTIONS

> i s . vector ( x )[ 1 ] TRUE> x[1 ] 2 5 8 11 14 17 20> gimmeHalf ( x )[ 1 ] 1.0 2.5 4.0 5.5 7.0 8.5 10.0

So by default, you can work with vectors of your values just as easy as single numbers.This is pretty cool and you should try to remember the love that R has for vector oper-ations because it is much faster to call your gimmeHalf() function by passing it vector ofvalue than using a loop to go through the vector and calling gimmeHalf() for each individualvalue...

Here is a slightly longer example of a function. Notice that inside the function, I haveadded some comments. This is a very good idea because it allows you to document whatyou are doing inside the function. In fact, I typically write functions by:

1. Write the signature of the function, the funcName <−function() part.

2. Using comments, write the sequence of events that have to occur inside the functionso I can see what needs to be done (breaking large problems into small ones here)

3. Fill in the code to allow R to do my bidding.

So lets walk through these steps and make a function. The purpose of this function isto get a little encouragement for my programming endeavors by having R return somenice praise for me.

Step 1: Create signature The signature for this function will be:

giveMeSomeMomLove <− function ( )

Step 2: Using comments create logic of function: The overall goal of this function is toreturn a random statement from my mother so I will have to set up some statements,find a random one,, and then return it.

giveMeSomeMomLove <− function ( ) # set up a vector of loving mother sayings

# pick a random number to use as index for responses

# I f you put the name vector and the index on the last l ine

Step 3: Fill in the R logic: Now that I have the comments set out, it is fairly easy for meto use them as a guide in laying out the logic of function. You do not have to documentevery line of code in your functions, but if you put in enough so that it is obvious what isgoing to happen next, you will find yourself being happy with your past self more oftenthan hating what you had forgotten to do (?).

giveMeSomeMomLove <− function ( ) # set up a vector of loving mother sayingsmomSayings <− c ("Honey, your dad and I think you are doing just fine." ,

"Come home this weekend, I made your favorite dessert." ,"We think you are the BEST student at VCU." ,"You know I took calculus back in college, maybe I can help." ,

Biological Data Analysis Using R

12.1. FUNCTION SYNTAX 193

"I just know you’ll be able to find a good job after college." )

# pick a random number to use as index for responsesresp = round ( runif (1 , 1 , length (momSayings ) ) )

# I f you put the name vector and the index on the last l inemomSayings [ resp ]

> giveMeSomeMomLove ( )[ 1 ] "We think you are the BEST student at VCU."> giveMeSomeMomLove ( )[ 1 ] "Honey, your dad and I think you are doing just fine."> giveMeSomeMomLove ( )[ 1 ] "You know I took calculus back in college, maybe I can help."

Feel free to add some of your own mother sayings here

12.1.2 Passing Values To A Function

The most common way you will interact with a function is probably by giving it somevariables and expecting to get something back.

getIdentityMatrix <− function ( numRows )

# make a square matrix with a l l zerosI <− matrix ( 0 , nrow=numRows, ncol=numRows )

# make the diagonal a l l onesdiag ( I ) <− 1

# return i t to the ca l l e rI

> getIdentityMatrix (2 )

[ ,1 ] [ , 2 ][1 , ] 1 0[2 , ] 0 1> getIdentityMatrix (5 )

[ ,1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ][1 , ] 1 0 0 0 0[2 , ] 0 1 0 0 0[3 , ] 0 0 1 0 0[4 , ] 0 0 0 1 0[5 , ] 0 0 0 0 1

Default Values

Functions can have default values associated with variables that are passed to them.We’ve seen this many times so far as you’ve looked up and seen the function signaturesof built in variables. This is a very convenient feature for you and your users. In general,when you think of writing functions you should not try to make them so specific thatyou have a lot of different functions that do almost the same thing, rather you shouldmake them robust and if you can combine a few functions into a single one whosevalues change depending upon a parameter you pass to it, it is better overall form. Forexample, the function getIdentityMatrix() returns a square matrix with ones down thediagonal. This matrix is a pretty special one (see ??) in matrix analysis and probably

Biological Data Analysis Using R

194 CHAPTER 12. FUNCTIONS

should have its own function just because of its status. However, there are a number ofreasons why you may need a square matrix with a single value down the diagonal andperhaps it would be more robust to create a function such as:

getDiagonalMatrix <− function ( size , value=1 )theMat <− matrix (0 ,nrow=size , ncol=size )diag ( theMat ) <− valuetheMat

> getDiagonalMatrix (3 )

[ ,1 ] [ , 2 ] [ , 3 ][1 , ] 1 0 0[2 , ] 0 1 0[3 , ] 0 0 1> getDiagonalMatrix (3 ,42)

[ ,1 ] [ , 2 ] [ , 3 ][1 , ] 42 0 0[2 , ] 0 42 0[3 , ] 0 0 42

Now this function has a default value to set the diagonal values to (e.g., 1) producing theIdentity matrix I by default, however, it can also produce any diagonal matrix when youpass an additional parameter to the function. If you do not pass it to the function, itis assigned in the signature for you by default. This makes the function perhaps morerobust and useful. Of course, this is all up to you, you are the programmer here and youget to make the decisions. After all, there are several different ways to get the correctresult when programming and as Biologists, we should focus on the biology and usetools like R as simple tools.

12.2 Scope

The scope of a variable determines the value that it has depending upon where it islocated. This topic is a pretty important one and can be a bit tricky at times.

myFunc <− function ( x )x <− 42cat ("x inside function is" ,x ,"\n" )> x <− 21> x[1 ] 21> myFunc( x )x inside is 42> x[1 ] 21

myFunc <− function ( a )x <− 42cat ("other x inside function is" ,x ,"\n" )> x <− 23> myFunc( x )other x inside function is 42> x[1 ] 23

Biological Data Analysis Using R

12.3. USEFUL FUNCTIONS 195

12.3 Useful Functions

The following functions were introduced in this chapter and you will be required to usethem for the exercises. To get more information on any of these functions, use the Rhelp system.

• function(args)code Creates a function that has the code inside code requiring the ar-guments args.

• return(x) Returns the value x from the function which means it is immediately exitedand no more code is executed in the function.

Biological Data Analysis Using R

196 CHAPTER 12. FUNCTIONS

12.4 Exercises

The following exercises are meant to help you understand the items presented in thisChapter.

1. Create a function that allows you to pass it a regression model and it will return astring that contains the formula for the model as you would like to have it displayedon a graph.

2. Create a function that takes a single vector of values and creates a histogram anddensity line from that data in a new graphics window.

3. Explain scope and how it pertains to the values assigned to variables.

4. Create a function that takes an ANOVA or Regression model and saves the ANOVAtable to a file. You should probably allow the user to pass a file name to the function.

5. How do you set default values for a function when you write it?

6. Explain how you get your functions to accept vector arguments.

7. Create a function that returns random numbers but allow the user to set an optionalargument that will only return even numbers.

8. How would you remove a function from the memory of R ?

9. Lets assume that you have a folder full of data files named Data1, Data2, Data3, . . .,Data40. Write a function that creates these file names dynamically. You will want toallow the user to specify the base name of the files (e.g., Data) as well as the startingand ending numbers (e.g., 1 and 40) but set the starting number to default to 0.

10. How do you make sure that the arguments that are passed to your functions are theright kind of variables? For example, what if I passed the variable x <−"this is the end"

to a function that expects a number.

Biological Data Analysis Using R

Appendix A

Answers to Exercises

In this section you will find answers to the odd numbered Exercises presented in eachChapter. These answers are meant to help you start on the exercises facilitating yourcompletion of the remaining questions. It is my recommendation that you look at theanswers only after you have completed them just to make sure that what you thoughtyou were doing is the correct thing. Don’t look ahead....

Ansers to Chapter 2.

197

198 APPENDIX A. ANSWERS TO EXERCISES

Biological Data Analysis Using R

Appendix B

Installing Additional Libraries

The R statistical computing environment is made more robust by the addition of externallibraries. Libraries can be written in R , C, or FORTRAN by you or other people who want toexpand the functionality and utility of R .

B.1 Library Availability

There is a list of libraries available at http://cran.r-project.org. As of the time of thiswriting, there are currently 1621 different packages in the repository. All are availablefor you to install and use at your discression. Each should also come with a set ofdocumentation covering all the functions that are included in the library, descriptions ofthe data sets, and some overall discussions on the library along with the library.

B.2 Installing Libraries

B.2.1 Using install.packages() As A GUI

The easiest way for you to install a libarary is to do so from within R itself. To do this,your machine must be connected to the internet. R knows how to find, download, andinstall binary versions of packages using a tck/tk interface GUI interface.

If you conduct the installation as a normal user that does not have administrative priv-ilages on your computer, the libraries will be installed in a location that is in your ownhome directory. Depending upon which operating system you have, this will be in dif-ferent places. The main thing to worry about here is that when you install libraries intoyour own directory they will only be available to that user and will not be available forany other users on that machine. If two people use the same machine then they will haveto install it twice, once in each home directory. Conversely, if you have administrativeprivelages on the machine you are using, you can install the libraries into a location thateveryone that uses that machine can access.

To start the installation process, issue the command:

199

200 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

> i ns ta l l . packages ( )

And this will bring up a window (using tck/tk so it won’t look quite like the normalwindow on your operating system) that allows you to select which mirror you wouldlike to use for downloading. An example window is shown in Figure B.2.1. In general,you should select a location that is geographically proximite to your current location.All of these mirror servers are kept up-to-date pretty well and you shouldn’t find anydifferences among the packages on any of them.

Once you have selected your preferred mirror server, another window will be presented(resembling Figure B.2.1) that lists all the packages that are available to be installed.Be careful here, this simple interface does not check to see which packages you alreadyhave installed, it only lists all the packages that are at your disposal. So just becausethere is a package on that list doesn’t mean that you do not already have it installed onyour machine.

Select the package, or packages, that you want to install from the list. To select morethan one, click on more than one... To deselect a package, click on it a second time and itwill be deslected. Once you hit the OK button on this window, the install .packages() functionwill look to see what dependencies the selected packages have (e.g., PackageA requiresPackageB but you didn’t know that and didn’t select it). Packages will be downloaded andinstalled in the correct location. After they are installed, you should be able to use themimmediately (e.g., without restarting R ).

B.2.2 Using install.packages() For Specific Libraries

If you know the name of the package that you are interested in installing you can usethe install .packages() function directly by passing it a name, or list of names, of the packagesyou are interesed in. This will skip the Package Selection Window step shown in FigureB.2.1. The syntax for this would be:

> i ns ta l l . packages ("theNameOfTheLibraryNeeded" )

Libraries have also be partitioned into different Task Views. These are meta-packagesthat contain several different packages under a particular theme. Below are a list of theviews that are available as of January 2009 (these categories and desriptions are lifteddirectly from the website.

Bayesian Bayesian Inference

ChemPhys Chemometrics and Computational Physics

Cluster Cluster Analysis & Finite Mixture Models

Distributions Probability Distributions

Econometrics Computational Econometrics

Environmetrics Analysis of Ecological and Environmental Data

ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data

Biological Data Analysis Using R

B.2. INSTALLING LIBRARIES 201

Figure B.1: Example of CRAN mirror window as viewed on Linux

Biological Data Analysis Using R

202 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

Figure B.2: All packages that can be installed from the selected mirror server on my machine.

Biological Data Analysis Using R

B.2. INSTALLING LIBRARIES 203

Finance Empirical Finance

Genetics Statistical Genetics

Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization

gR gRaphical Models in R

MachineLearning Machine Learning & Statistical Learning

Multivariate Multivariate Statistics

NaturalLanguageProcessing Natural Language Processing

Optimization Optimization and Mathematical Programming

Pharmacokinetics Analysis of Pharmacokinetic Data

Psychometrics Psychometric Models and Methods

Robust Robust Statistical Methods

SocialSciences Statistics for the Social Sciences

Spatial Analysis of Spatial Data

Survival Survival Analysis

TimeSeries Time Series Analysis

You can install all the libraries in these particular views by invoking the command:

> i ns ta l l . packages ("ViewName" )

You will still have to specify the mirror server to use and once you do, R will take itfrom there. This could be a lengthy process as it may require numerous packages to bedownloaded and installed. Be patient.

B.2.3 From the Command Line

Finally, there is one other method that I typically use on my machines. This is because Itypically download the source packages rather than the pre-compiled binaries. However,this method also works with binaries. You can download the package from the CRANsite directly and then open a command-line Terminal and change to the directory wherethe package is located. From there issue the command:

R CMD INSTALL ThePackageYouDownloaded.tar.gz

and R will install it for you. If you do this as the root or administrator person, it willinstall it in a globally accessable location so any user on that machine will have accessto it.

Biological Data Analysis Using R

204 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

Biological Data Analysis Using R

Bibliography

Caswell, H. (2001). Matrix population Models: Construction, Analysis, and Interpretation.Sinauer Associates, Sunderland, Mass., 2nd edition edition.

205

Index

class, 115clustal file, 153coercion, 9comment character (#), 172

data types, 8character, 10complex, 11constant, 11data frame, 18, 26factors, 16integer, 8list, 17logical, 13matrix, 14NULL, 31numeric, 9raw, 12vector, 13

distributionsdchisq, 43, 68df, 43, 68dnorm, 43pchisq, 43, 54pf, 43pnorm, 43qchisq, 43, 44, 54qf, 43, 46qnorm, 43qt, 46rchisq, 43rf, 43rnorm, 43, 54, 58, 68rpois, 63runif, 65, 107

fasta file, 152figure

axis labels, 56title, 56

functions, 6%%, 186abline, 186any, 150, 161as.factor, 72, 86as.index, 186as.matrix, 86, 144as.matrix(), 121attributes, 33barf, 123barplot, 137binom.test, 73c, 86cat, 118, 161, 172cbind, 29, 40, 86, 107class, 20, 33colnames, 86, 157components, 6cov, 64density, 57, 58det, 128diag(), 126dim, 123dist.dna, 154eigen, 132else, 186else if, 186expression, 161for, 186format, 161function, 195gimeMeSomeMomLove, 192ginv, 128grep, 150grey, 118gsub, 150image, 118index, 127, 186kurtosis, 60length, 20, 86

206

INDEX 207

levels, 17lm, 128load, 32, 40log, 7ls, 32matrix, 121, 145max, 61, 118mean, 58, 129merge, 39, 40min, 61names, 33nchar, 148, 161nj, 154par, 49paste, 11, 20, 95, 149plot, 47, 155print, 172q, 31qchisq, 45range, 50, 61, 81, 86rbind, 28, 40read.dna, 153read.table, 27, 28, 86, 145read.table(), 121rep, 14, 20return, 191, 195rexp, 54rm, 32, 40, 172, 186rnorm, 54, 56, 118round, 107row.names, 33rownames, 86, 157rpois, 52, 54save, 32, 40sd, 58seq, 14, 20skewness, 59source, 172strsplit, 148sub, 150subset, 35, 36, 40substring, 149, 161sum, 127summary, 17, 20, 86, 93, 107,

172t, 128table, 17, 68, 72unlist, 148, 162

var, 58while, 186

genetic distance, 154grahics

pdf, 51graphics

abline, 94, 107barplot, 137, 144bg, 48bmp, 51bty, 48bxp, 85cairo pdf, 51cex, 48col, 48density plot, 57dev.copy, 52, 53dev.off, 52, 53fg, 48hist, 52, 55jpeg, 51, 53legend, 142, 145line plot, 47lty, 48lwd, 48main, 48mfrow, 48, 61optional parameters, 48overlaid, 49par, 48pch, 48, 107pictex, 51plot, 46, 68, 85png, 51postscript, 51quartz, 51rug, 104scatter plot, 46, 47sub, 48text, 107tiff, 51topo.colors, 52type, 48x11, 51xlab, 48xlim, 49ylab, 48

Biological Data Analysis Using R

208 INDEX

ylim, 49

matrix%*%, 144addition, 123det, 144diag, 144diagonal, 126dim, 144eigen, 145element-wise multiplication, 124ginv, 145Hadamard product, 124multiplication, 124scalar addition, 123scalar multiplication, 124scalar subtraction, 123Schur product, 124subtraction, 123t, 145trace, 127

Neighbor Joining, 154

operatorassignment, 18logical, 19numerical, 18

operator order, 18

Pinaceae, 153

statsanova, 93, 107aov, 107binom.test, 86chisq.test, 72, 76, 86cor.test, 67, 79, 86interaction formula, 99Kruskal-Wallis Test, 82, 83kruskal.test, 86lm, 92, 107Mann-Whitney, 80mean, 58, 68, 81, 86median, 63nj, 162no intercept, 100quantile, 63sd, 68

step, 107t.test, 107TukeyHSD, 107var, 58, 68Wilcoxon, 80Wilcoxon Test, 81

variable, 7

Biological Data Analysis Using R