bootcamp - ucla statisticsrosario/boot/bootcamp.pdf · 1 r bootcamp sunday, september 23, 2007 ucla...

34
1 R Bootcamp Sunday, September 23, 2007 UCLA Department of Statistics Outline 9-10a Breakfast 10-1 R Basics, Volume 1 1-2:15 Lunch 2:15-5 R Basics, Volume 2, and more cool stuff. We will be emphasizing what you need to know to do your first few assignments. We will revisit the more advanced stuff (mentioned briefly here) in Statistics 202A. Credit The data and much of the code were originally produced or collected by Mark Hansen, and were altered for use for this year’s program. A good amount of material and topics were added and deleted as as appropriate for this year’s Bootcamp. What is R? •“R is a language and environment for statistical computing and graphics” developed by John Chambers and friends. •“R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.”

Upload: vukhanh

Post on 11-May-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

1

R Bootcamp

Sunday, September 23, 2007UCLA Department of Statistics

Outline

• 9-10a Breakfast• 10-1 R Basics, Volume 1• 1-2:15 Lunch• 2:15-5 R Basics, Volume 2, and more cool

stuff.We will be emphasizing what you need to know to do yourfirst few assignments. We will revisit the more advancedstuff (mentioned briefly here) in Statistics 202A.

Credit

The data and much of the code were originallyproduced or collected by Mark Hansen, and werealtered for use for this year’s program.

A good amount of material and topics were added anddeleted as as appropriate for this year’s Bootcamp.

What is R?

• “R is a language and environment forstatistical computing and graphics” developedby John Chambers and friends.

• “R provides a wide variety of statistical (linearand nonlinear modeling, classical statisticaltests, time-series analysis, classification,clustering, ...) and graphical techniques, andis highly extensible.”

2

What is Statistical Computing?

“Computing is the vehicle by which wecommunicate statistical ideas to ourcollaborators, to the world at large.”- Mark Hansen

Why R?

• It can be installed on Mac, Windows andLinux (SAS cannot run on Mac).

• Simple, yet elegant graphics (…with practice)• Its command-line model is very flexible and

very extensible.• It is a huge project with a lot of contributed

packages.• It can interface with other open-source and

commercial software.• All of this is FREE.

Why R?• It is widely used by Statistics departments all over the

world, and usage is growing daily.• Other universities (UC Santa Barbara and the

university in South Central LA) use S+, which issimilar to R, but is not free.

• Most S code can be run in R and vice-verse.• Can interface with faster languages like C and

FORTRAN, and faster tools like MATLAB.• Oh yeah, and industry is beginning to recognize it as

a required skill (read “must know MATLAB, S or R”)• Again, it is FREE

An Analogy

“R is to S as Octave is to MATLAB.”

3

How to get R at Home• Visit http://cran.r-project.org, or the UCLA

mirror at http://cran.stat.ucla.edu

A First Time for EverythingThere are two ways to start R

1. Click on the R icon in the dock:

The Dock is the strip at the bottom of thescreen with a bunch of icons.

A First Time for Everything2. From the Terminal on your local account, or from

lab-compute.stat.ucla.edu type R and hitENTER.

lab-compute is used for 202A, not now.

4

A First Time for EverythingThe > symbol is the R command prompt. First off, Rcan be used as a bloated calculator.

> 5+7[1] 12

> log(exp(sqrt(sin(exp(pi)))))[1] 1.106619e-08

It is important to note that arithmetic iselementwise when working with vectorsand matrices.

The basic arithmetic operators are +, -, *and /. To get the remainder of dividing a byb, use the modulo, which is %%. Divisionusing / is standard division, not integerdivision.

Each output is a scalar: a 1 x 1 vector. The[1] means that what is printed is the firstelement of the returned vector.

Order of Operations“When in doubt, use parentheses!” ()Parentheses clarify precedence inexpressions.

> 4/5/2[1] 0.4

> 4/(5/2)[1] 1.6

Order of Operations

(1) Parentheses ()(2) Exponents ^(3) Multiplication *(4) Division /(5) Addition +(6) Subtraction -

PEMDAS: Please Excuse My Dear Aunt Sally

Uh Oh: Unbalanced ()If you use open a term with (, you must close it with ).If you don’t, R won’t know what to do; R will not give anerror.

> 2+(5/(2/4)+

The + sign on the next line means that R is waiting formore input. It usually means you are missing a ), ], or }(we’ll get to the last two later). If you hit enter again,another + will appear. Argh! To break out of this, hitESC followed by ENTER.

5

Variables and AssignmentsIn order to something with a calculation, we need to be ableto reference it later. We want to store the output (value) of anexpression into a variable. This is called assignment. Weassign a value to a variable using <-, or =.

> x = sqrt(25)> x[1] 5> y <- 4> z <- x^2 + y^2> x^2 + y^2 -> z

Variable names must start with a letter and cannot containthe colon (:), semicolon (,), hyphen (-) or comma (,).

Assignment can also use = instead of <-,but <- is preferred. Assignment can alsobe done the other way -> but it is notrecommended.

ObjectsA variable is a type of object in R. Basically, everything in R is anobject.

When we store an object in R, it is stored in your workspace. Theworkspace is like a virtual folder that only R knows about.

We can display what is currently in the workspace using thefunctions objects() or ls(). They are both identical.

> ls()[1] "x" "y" "z”

The output from ls() is actually a vector of length 3 and can beassigned to a variable.

Removing ObjectsTo remove a variable from the workspace, use the functionsremove() or rm().

If you are a Unix user, you should notice a pattern…

> remove(x)> ls()[1] "y" "z”> rm(z)> objects()[1] ”y”

Clearing the WorkspaceSometimes we want to remove ALL of the objects in a workspace.Sometimes this is a good debugging step.

The remove() and rm() functions take a vector of object namesand removes all objects that have names stored in that vector.

rm(list=ls())Remember that ls() returns a vector. Here I pass that vector intothe function as a parameter. list= tells rm() that I am passing ita vector of names (that it calls list) instead of a single object name.list is the name of the parameter that we are passing to rm. This isnot to be confused with a list data type!

6

Reviewing Previous Commands1. You can use the UP and DOWN arrows on the

keyboard to scroll through your previous commands,one at a time. This works in the GUI, and is supposed towork in the Terminal.

2. You can display the history by typing history()-OR-

If you are using the GUI version of R (not using theterminal) then you can display your entire commandhistory by clicking on the circled button below.

Reviewing Previous CommandsBefore quitting R today, you may want to save yourcommand history as a reference.

savehistory(file="history.log")

If we only include a filename (like above), the file issaved in the current working directory.

R routinely saves your history into the file .RHistory.This is how R is able to display your history on thescreen.

The Working DirectoryIf we specify a filename without a path, likemyfile.txt, R will look in the current workingdirectory for the file. This is true for both importing filesand exporting files.

To display the current working directory, type getwd().

> getwd()[1] "/Users/ryan”

If I write a PDF file, save my history, or try to read datausing only the filename, files will be read from or writtento the current working directory, /Users/ryan

The Working DirectoryI do not want files read from or written to my homedirectory, so I change it to something else, say/Users/ryan/Desktop/R_bootcamp

To change the current working directory, use setwd().

> setwd(“/Users/ryan/Desktop/R_bootcamp”)

The new directory must already exist for this to work.

You should use a separate directoryfor each assignment or project!!!

7

The Working DirectoryYou can override the use of the working directory bypassing the full path to the file you want to read/write.

save(x,file=“/Users/ryan/Documents/x.Rdata”)

Will save the file x.Rdata in the directory/Users/ryan/Documents/, NOT in the currentworking directory.

The Working DirectoryFinally, we can list the files in the current directory byusing the command dir().

> dir()[1] "history.png" "thermal.R"

Unix buffs take note! In R ls() lists the objects in theworkspace, dir() lists the files in the current workingdirectory.

R Sessions

Each time we start R, we start a new R session. When we quit R,the session is destroyed (a geeky term).

When we quit R, we are asked whether or not we want to save thecurrent workspace. You should answer “yes,” however, there is abetter way to save your workspace that ensures that it is indeedsaved. We will visit this in a bit. If you save your workspace usingthis default method, it is stored as the file .RData in the directoryin which R was called from.

When we enter R again, this default workspace will be restored.

Do not quit R right now.

Saving the WorkspaceSaving your workspace allows you to store all of yourwork in one nice little package. You can then take thisworkspace file with you to another computer and workon it without having to reimport all of your data and runany code again.

save.image(file=“myworkspace.Rdata”)

IMPORTANT: While R is running, everything is stored inmemory. If R crashes, you lose your work. So save yourwork, and save it often.

We can reload the workspace by using the load function

load(“myworkspace.Rdata”)

The variables are restored using the same name theywere saved as.

8

Saving the WorkspaceHere is proof that this works…

> save.image(file="myworkspace.Rdata")> ls()[1] "x" "y" "z"> rm(list=ls())> ls()character(0)> load("myworkspace.Rdata")> ls()[1] "x" "y" "z"

Saving ObjectsWe can also save just certain objects. This is good to use whenregenerating the object takes a long time or is a pain to do (parameterestimates from a simulation perhaps).

save(x,file=“x.Rdata”)

The first parameter is the object that you want to save. The second is thename of the file that we want to save the object to.

We can load the object by using the load function

load(“x.Rdata”)

The variable is restored using the same name it was saved as.

Ending the Fun: Quitting RTo quit R, type q().

You are asked if you want to save your work. If youanswer y, save.image is called. You are taken back tothe Unix prompt.

Our Data ExerciseOur Data Exercise

9

Thermal Mapper

This project took place in the WhiteMountains. This site was chosen dueto its extreme temperature differencebetween the air and the surface.

A robotic scanning device containinga temperature sensor and a camerawas used to examine the groundcover.

Thermal MappingBelow is a portion of a 3 m transect over a granitefellfield. The robot moves along the transect recording ameasurement every 1cm.

A complete pass along the transect took about 10 mins,and ran continuously from from 8:30am August 1 to8:40am August 2, 2005.

Thermal MappingThis image is located at

http://www.stat.ucla.edu/~rosario/boot/thermal.jpg

You will need it!

Loading the DataThere are several ways to load data into R. For now, we will keep itsimple and import all of the data. Later on I will briefly describe theother methods, but we will not import any data using them.

source(“http://www.stat.ucla.edu/~rosario/boot/data.R”)

This file automatically creates all of the data files for you and dumpsobjects into your workspace. Use ls() to list these new objects. Youshould see the following

> source("http://www.stat.ucla.edu/~rosario/boot/data.R")

> ls()[1] "groundcover" "positions" "scantimes" "temp"

10

Introducing our Datapositions: a vector of length 249. (length(positions))The distance of the spot from a fixed reference point. In cm.

scantimes: a vector of length 148 that contains the end times of each scan.

temp: a 249x148 matrix repesenting the temperatures collected by the robot ateach position. (dim(temp))ncol(temp) returns number of columns (one per scantime)nrow(temp) returns number of rows (one per location)(dim, ncol and nrow only work on data.frames and matrices)

groundcover: a data frame containing two varaibles, one indicatingposition, and another indicating what was found at that position: plant,sand, rock.

ObjectsEverything in your workspace is an object. We can see what kind ofobject each is by using the class() function.

> class(groundcover)[1] "data.frame"> class(positions)[1] "numeric"> class(scantimes)[1] "character"> class(temp)[1] "matrix"

Numeric and characterare types of vectors.

VectorsA vector is a collection of values grouped together inone container. In R, we have 3 basic value types:

numeric: real numberscomplex: complex numbers*integer: integers (subset or numeric)character: individual characters, or stringslogical: TRUE or FALSEraw: raw bytes*

Vectors can contain the value NA for missing values.

Vectors consist of only one type of data; this type is called themode of the vector and can be queried with the mode function.

Data StructuresAs statisticians, we often work with rectangular data.That is, each row is an observation, and each column isa variable. The data structures in R reflect this practice.

vectors are ordered collections of primitive elements.matrices and arrays are rectangular collections ofprimitive elements (same type) having 2 dimensions.factors are categorical variables.data.frame is a 2 dimensional data table consisting ofcolumns whose elements are of the same type (“adataset” if you will).

11

Displaying VectorsWhen displaying a vector that has moreelements than can be displayed on the screen,R will wrap the output to the next line a print anumber in brackets [] indicating the index ofthe first element on that line.

mode(positions)[1] “numeric”

Working with VectorsMany types have have functions you can use to ask anobject “are you of type…?”

> is.vector(scantimes)[1] TRUE> is.vector(positions)[1] TRUE> is.vector(temp)[1] FALSE ##it’s a matrix

This function is an example of one that returns alogical.

Aside: mode vs. classLet x be a matrix.

class(x) returns an object’s type:

> class(x)[1] “matrix” #or data.frame,numerical, etc.

mode(x) returns the data type of the elements containedin x.

>mode(x)[1] “numeric” #or logical, numerical, character

Aside: ObjectsWe can find the length of a vector which issimply the number of elements it contains. Forother object types (particularly matrix anddata.frame), the length may not returnsomething useful.

> length(positions)[1] 249

All objects have a length, class and mode. Forvectors, the class is just its mode.

12

Creating Empty VectorsWe can create an empty vector v several ways.

v <- NULL

v <- c() creates a concatenation of 0 values.

v <- vector(length=100,mode=“numeric”)

Usually we can create a vector from values.

Creating VectorsWe can create a vector by concatenating valuesusing the c() function.

v <- c(1,2,3) #numeric

v <-c(“celery”,”lettuce”,”pepper”) #character

v <- c(TRUE,TRUE,FALSE) #logical

Usually we can create a vector from values.

Creating Vectors of IntegersOften vectors of integers are used to create indices intoother vectors, or matrices. We can create integer vectorsas follows

> v <- 1:100 #a vector containing integers 1-100.> class(v)[1] “numeric”

> v <- rep(1,5) #repeat 1 five times.> v[1] 1 1 1 1 1

> x <- 4:6 #integers 4 thru 6> y = rep(x,1:3) #repeat elements of x: 1, 2 and 3 times.> y[1] 4 5 5 6 6 6

Vectors of Mode numericWe can also construct vectors of mode numeric using the seq(for sequence) function.

> x = seq(0,3,len=100)> #sequence of 100 evenly spaced> #values from 0 to 3

> x = seq(0,10,by=2)> #sequence of values from 0 to 10,> #difference of 2 between values

> x[1] 0 2 4 6 8 10

13

More Fun with SequencesWe can build more advanced sequences that may be helpful.(Honquan Xu likes to use these)

> rep(9:11,3) #repeat vector (9,10,11) three times.[1] 9 10 11 9 10 11 9 10 11

> rep(9:11,c(3,3,3))> #Repeat each of the three elements three times.[1] 9 9 9 10 10 10 11 11 11> #Can also be written…

> rep(9:11,rep(3,3))

Do you see the connection between methods 2 and 3?

Naming Elements of a VectorWe can name each element of a vector while creating it,or after creating it.

Naming Elements during Creation> x <- c(“A” = 4.0, “B” = 3.0, “C” = 2.0)

Naming Elements after Creation using the names Function> x = c(4.0,3.0,2.0)> names(x) <- c(“A”,”B”,”C”)

> names(x)[1] “A” “B” “C”

Printing Vectors with Named ElementsR prints vectors containing named elements differently.Instead of printing the index (i.e. [1]), it prints the nameof each element above the element.

> x <- 1:26> names(x) <- letters> x a b c d e f g h i j k l m n o p q r s t u v w x y z

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

(letters is a global vector containing the alphabet)

Operations on VectorsMany operations in R work on entire vectors sincethe vector is the basic object in R.

Suppose each variable below is a vector of length100.

> x <- runif(10)> z <- sqrt(x) #take sqrt of each element of x> y <- x + 2*z + rep(c(1,2),50)

14

Recycle and ReuseR will let you add two vectors of differing dimension. I don’trecommend this, but it works…if used correctly. R will recycle theelements of the shorter vector to match the length of the longervector.

> c(1,2,3) + 2 #no warning (3 is multiple of 1)

[1] 3 4 5> c(100,200,300,400) + c(1,2,3)[1] 101 202 303 401

Warning message:longer object length

is not a multiple of shorter objectlength in: c(100, 200, 300, 400) + c(1, 2, 3)

CoercionVectors can only contain one element type. If we mixtypes, R will try to convert all of the elements to a commonformat. It is not recommended doing this intentionally.

> x <- c(1:3,”a”) #numbers converted to characters[1] “1” “2” “3” “a”

#Can be useful, not recommended.> c(TRUE,TRUE,FALSE)+3[1] 4 4 3

Do-It-Yourself Type CastingRather than rely on coercion, we should type cast objectsto a common type. Type casting just means converting avariable of one type to another (comparable) type.

> as.integer(2/4) #Integer division[1] 0

Let a correct answer on a multiple choice test be denoted TRUE, andlet x be the vector of scored responses (logical)

> score <- sum(as.numeric(x)) #coercion is ok here.

Vector FunctionsR lets us perform important data management operationson vectors like sorting and finding unique values in avector.

> sort(x)[1] -0.916999756 -0.497566155 0.003459213 0.191734263

> y <- c(1,1,2,2,2)> unique(y)[1] 1 2

> duplicated(y)

[1] FALSE TRUE FALSE TRUE TRUE

15

Concatenating VectorsWe know we can concatenate values using c(…) but we can alsoconcatenate vectors with c(…)

> x <- c(1,2,3)> y <- c(4,5,6)> z <- c(x,y) #1 2 3 4 5 6

With coercion…> x <- c(TRUE,TRUE,FALSE)> y <- c(1,2,3)> z <- c(x,y) # 2 3 4

Add the value 4 to y…> c(y,4) # 1 2 3 4

SubsettingIt is unavoidable; we will always want to pick out certainvalues from a vector depending on some conditions. Howto extract from a vector:

scantimes[6] # the sixth elementscantimes[6:10] # elements 6 thru 10

scantimes[c(1:5,8,11:15)]#elements 1 thru 5, element 8, 11 thru 15.

Called Indexing by Position

SubsettingWe can also exclude portions of a vector and keepeverything else.

scantimes[-6] # everything but the sixth elementscantimes[-c(6:10)] # everything elements 6 thru 10

We can extract using 1:6, but to exclude, we use -c(1:6).

We cannot use both extraction and exclusionsimultaneously.

letters[c(-10,11,43)] #error

Called Exclusion by Position

The Five Rules of SubsettingWe use the brackets [] to subset. R has 5 rules for

selecting certain elements from a vector.

1. Indexing by position Done!2. Indexing by exclusion Done!3. Indexing by name4. Indexing with a logical mask5. Empty subsetting

16

Subsetting: Indexing by NameRecall that we gave names to the elements of avector.

> x <- 1:3> names(x) <- c(“a”,”b”,”c”)> x[“b”]b2

Subsetting: Indexing with a Logical Mask

Using logical masks can make data extraction muchfaster. Life is difficult without them!!

> x <- 6:10> x >= 7[1] FALSE TRUE TRUE TRUE TRUE

Thus,> x[x >= 7] #”x such that x >= 7”[1] 7 8 9 10

It’s So Logical: Logical Operators

We use logical operators in subsetting as well aswith loops and control statements.

Logical operators include

< > <= >= == !=

It’s So Logical: Logical Operators

As humans, we are picky. We often want to selectelements that meet more than one condition. Logicaloperators are the glue that makes this possible.

& AND must meet both conditions> x <- 10> x < 5 & x > 7 #FALSE> x > 1 & x > 2 #TRUE

| OR must meet either condition> x<5 | x==10 #TRUE> x<5 | x>20 #FALSE> x>5 | x>9 #TRUE

17

It’s So Logical: Logical Operators

! NOT must not meet the condition> x <- 10> !(x < 5) & !(x > 7) #FALSE> !(x < 5) & (x > 7) #TRUE> !((x < 5) | (x > 7)) #TRUE

> x <- 3.3> !is.na(x) #testing for complete cases.

Also, all(v < 10) returns TRUE if all elements of vare < 10. any(v < 10) returns TRUE if any elements of vare < 10.

Are you a good which or a bad which?

which is your best friend! Pass a logical expression towhich and it returns the indices of the vector where thatcondition is true.

> data <- c(100, 100, NA, 200, 254, NA, 14)#data with missing values. To remove…> complete.cases <- which(!is.na(data))#Use result to subset the original data.> complete.data <- data[complete.cases]

Use it!!!

A Clever Irony: Empty Subsetting

> X <- 1:5> x[][1] 1 2 3 4 5

Seems pointless, right? Not so fast.

> X <- 1:10> x[1] 1 2 3 4 5 6 7 8 9 10> x[] <- 5> x[1] 5 5 5 5 5 5 5 5 5 5

Bottom line: empty subsetting is used for blanket assignments.

Subsetting: Common Pitfall> x[“a”,”b”]Error in x["a", "b"] : incorrectnumber of dimensions

x is a vector! Only 1 dimension. I think you meant:> x[c(“a”,”b”),]

> x[-”a”] Error in -"a" : Invalidargument to unary operator

Can’t do this with names.

18

Subsetting for AssignmentAn analyst notes that the data from a flight data recorder has someinconsistencies. In particular, some inflight readings for altitude orvelocity are 0, or even negative. When we plot these data, thesereadings show as spikes that obfuscate any trend in the data.Suppose we have data only for the time that the plane is flying (weknow these parameters are > 0). Lets fix this…

> fdr.velocity<-c(550,600,0,600,600,650,0,-5,600,550,-5)> fdr.velocity[fdr.velocity <= 0] <- NA[1] 550 600 NA 600 600 650 NA NA 600 550 NA

Matrices

Matrices consist of two dimensions and can be muchmore powerful than just vectors.

For small matrices, we can display the values contained inthe matrix just as we do for a vector. But…

For large matrices, all we will see are a bunch of numbersflying by on the screen. We can visualize a large matrixusing the image function.

Visualizing a Matrix

image(temp)

This doesn’t tell us much…

Visualizing a MatrixWe can do better…much better.

Try this… Don’t worry, we will figure this out together in a moment.

> image(positions,1:148,temp,col=topo.colors(200),+ xlab=“position”, ylab=“time”,+ main = “Temperature vs. Time and Location”)

First off, we can split a long command over several lines by simplyhitting ENTER and continuing the command.

Now what this creature do? Oh the suspense is terrible!

19

Helping You Help Yourself!Some people love R’s help system (myself), and others hate it.

There are two ways to access help on a specific command orfunction. Let’s look up the documentation for the image function.

Type ?image at the command prompt. We can also usehelp(“image”) to access the same information. If we want tosearch the help system, use help.search(“findme”).

Let’s look away from these slides and look at the documentationtogether.

Things to consider: the ordering of the parameters, using anarbitrary order of parameters, default parameters, optionalparameters.

Helping You Help Yourself!Let’s dissect…

> image(positions,1:148,temp,col=topo.colors(200),+ xlab=“position”, ylab=“time”,+ main = “Temperature vs. Time and Location”)

The value of the first parameter (x) is positions, so image will plot thepositions values on the x axis.

The value of the second parameter (y) is 1:148 which is just a vectorcontaining the numbers 1 thru 148 inclusive, and tells image that there is onetemperature (z) value to plot for each of the 148 times given in scantimes.

The value of the third parameter (z) is temp because we want to color theimage using temperature.

Helping You Help Yourself!Let’s dissect…

> image(positions,1:148,temp,col=topo.colors(200),+ xlab=“position”, ylab=“time”,+ main = “Temperature vs. Time and Location”)

The value of the fourth parameter (col) is the value returned by thefunction topo.colors(200). We must append col= because Rexpected this parameter to be the 7th parameter, not the 4th. So weneed to tell R which parameter this is. topo.colors(200) is afunction call, and it returns a vector containing 200 color values fromthe topography color palette.

Helping You Help Yourself!Let’s dissect…

> image(positions,1:148,temp,col=topo.colors(200),+ xlab=“position”, ylab=“time”,+ main = “Temperature vs. Time and Location”)

The 5th parameter (xlab) instructs image to add a label for the x axis

The 6th parameter is trivially similar.

The 7th parameter (main) instructs image to give a title to the entireplot.

These last three parameters you will use over andover again, and they make your plots look nice.

20

Oooh! Ahhh!We notice something interesting: there seems to be a warmertemperature between positions 3750 and 4100. (Green=cooler,blue/white=hot).Exercise: what indices do these positions correspond to in thelocations vector? What do you notice? Draw a

Exercise 1What indices do these positions correspond to in the locations vector?What do you notice? Hint: look at the map of the transect athttp://www.stat.ucla.edu/~rosario/boot/thermal.jpg.Finally, mark this region in your plot with two vertical lines. Look upthe function abline in help.

Exercise 1: Answer…sort ofThere are many ways to do this.

What indices do these positions correspond to in the locations vector?

which(positions == 3750) #103which(positions == 4100) #138

What do you notice?

There is a huge boulder between these two points.

Finally, mark this region in your plot with two vertical lines. Look up thefunction abline in help.

abline(v=c(3750,4100),lty=2,lwd=0.7)

Exercise 1: Answer…sort of

21

Lunch Break

MatricesLike vectors, all elements of a matrix must be of the same type. Whenwe print a matrix, R prints column indices and row indices.

> A [,1] [,2][1,] 1 0[2,] 0 1

Creating Matrices from VectorsWe can glue together vectors rowwise or columnwise to create amatrix.

> x <- c(1,0)> y <- c(2,2)> I <- cbind(x,y)> I x y[1,] 1 2[2,] 0 2

> I <- rbind(x,y)> I x y[1,] 1 0[2,] 2 2

Creating Matrices from Scratch

We can also use the matrix constructor to create a matrix.

#Create a matrix of random uniform values: 5 rows, 2 columnsA <- matrix(runif(10),5,2)> A [,1] [,2]

[1,] 0.3778161 0.1205188[2,] 0.2967549 0.6951926

[3,] 0.4215449 0.2569569[4,] 0.9628448 0.4448082

[5,] 0.1582352 0.1582504

22

Creating Matrices from ScratchWe don’t have to specify both the number of rows andcolumns. We can leave one out, but remember the orderthat R is expecting the parameters to be in…

> A <- matrix(rnorm(10),2) #create a matrix with 2 rows#R figures it has 5 columns

> B <- matrix(rnorm(10),ncol=2)#create a matrix with 2 cols, R figures out it has 5 rows

R expects to see the parameter nrow first, so if we leaveit out, we must specify ncol=

Creating Matrices from ScratchSuppose we want to hardcode the matrix

We can create a matrix from a vector, but it is important tonote that R will interpret the values as columns unless wespecify byrow=TRUE.

matrix(c(1,1,2,4),2) matrix(c(1,1,2,4),2,byrow=TRUE)

[,1] [,2][1,] 1 2[2,] 1 4Wrong! Corr-ect!

[,1] [,2][1,] 1 1[2,] 2 4

R Cool Trick 1,063,199

Oh yeah, and R is cool with abbreviating parameter names as long asit can resolve what you mean.

> A <- matrix(rnorm(10),nr=2,nc=5)

This would not work if there there were other parameter namesstarting with nr or nc.

Reshaping Matrices

We can change the shape of a matrix, or coerce a vector into a matrix.

> x <- 1:10> dim(x) <- c(5,2)> x [,1] [,2][1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10

23

Naming Rows and Columns of MatricesWe assign names to the rows and columns of a matrix as follows. Note thedifference with vectors.

> x <- matrix(letters[1:10],5,2)> rownames(x) <-+ c("roberts","specter","kennedy","schumer","hatch")

> colnames(x) <- c("var1","var2")> x var1 var2roberts "a" "f"specter "b" "g"kennedy "c" "h"schumer "d" "i"hatch "e" "j"

Subsetting MatricesSubsetting with matrices is similar to subsetting with vectors exceptnow we have two dimensions (row and column) instead of one.

The temp matrix has as rows location, and columns time.

> temp[1,] #First row (location) of temp> temp[,1] #First column (time) of temp.

#upper left corner of temp/first 5 rows, first 5#columns> temp[1:5,1:5]

Subsetting MatricesAs an (very basic) example of subsetting amatrix, lets create a histogram of thetemperature of the first location over theentire period. Recall that the first locationis the first row of the matrix, we grab thewhole row by subsetting using the firstindex.

> hist(temp[1,])

The help file contains more informationabout how specify breaks and the numberof bins. Check it out!

I added a name to the axes and the plotitself, but I exclude this code to saveroom.

OperationsAs with vectors, R can operate on entire matrices at once.

There are some special functions that are only valid with matrices.

> x%*%y #matrix multiplication> W = t(x) #Transpose> Xinv <- solve(x) #Invert> D <- diag(x) #Extract diag. elements> D <- diag(1,10,5)#Create 10x5 matrix, 1s on diagonal.> eigen(x) #eigenvalue decomposition

24

Matrices: ApplyFinally, an operator that makes life a lot easier...

> x = matrix(1:50,10)> dim(x)[1] 10 5> apply(x,1,sum)[1] 105 110 115 120 125 130 135 140 145 150> apply(x,2,mean)[1] 5.5 15.5 25.5 35.5 45.5

Back to our Exercise

Let’s now put this to use; the point of the thermal mapperexperiment was to characterize differences in surfacetemperature as a function of groundcover.

Suppose we are told that row 178 of the temperature matrixcorresponds to a rock, and 218 to a plant.

Plot Two Locations over Time

> plot(1:148,temp[178,])> points(1:148,temp[218,],col=5)

> positions[c(178,218)][1] 4500 4900

First we plot the integers 1:148 (corresponding to each scan time) vs. thetemperature of location 4500 throughout the day.

Then we plot the temperature of location 4900 throughout the day on topof the original plot, and color the points green (col = 5).

Note: we use 1:148 instead of scantimes vector b/c scantimes is ofmode character.

A One Track Mind

R can only render one plot at atime.

To add the second plot on top ofthe first, use the points function.(Similarly there is also a linesfunction).

25

A One Track MindWe could also display the distributions for bothgroundcover types using side-by-side boxplots.

> boxplot(groundcover$type,temp)

We could also use formula notation (we will get to thislater) to make the same boxplot. If we wanted a boxplot ofjust one variable, then just pass boxplot one variable.

A One Track Mind

Another LookNotice that when we extract a single column or row froma matrix, R converts the result to a vector.

In our plot command, we have used a simple index(1:148) to represent time; recall that it takes the robotroughly 10 minutes for a pass.

Technically, if we just call plot(temp[1,]) we get thesame figure; in general, if x is a vector, in plot(x) Rassumes that you want to treat the data like a timeseries and uses 1:length(x) for the x-axis

Another LookWith subsetting and the plotfunction, we can start toexplore the basic hypothesis thatdifferent ground covers behavedifferently over the course of theday.

We note that the points in blackcorrespond to rock while the greendots are classified as plant .

What can we say about rock andplants in general?

26

Another Look> positions[177:185][1] 4490 4500 4510 4520 4530 4540 4550 4560[9] 4570> x <- apply(temp[177:185,],2,mean)> plot(x,type="l")> positions[208:222] [1] 4800 4810 4820 4830 4840 4850 4860 [8] 4870 4880 4890 4900 4910 4920 4930[15] 4940> y <- apply(temp[208:222,],2,mean)> lines(y,col=3)

Another LookThe function apply allows usto operate on rows or columnsof a matrix.

In this case, we have takenthe data in each column andformed an average with thebuilt-in function mean.

The “2” in this call tells R tooperate on the data in acolumn; if we had used “1”, itwould apply the mean functionto each row.

Another LookIn general, apply is a kind of vectorizedoperation, except in this case it works on entirerows or columns of a matrix (actually, it works onarrays, but we’re not there yet).

There are various flavors of this function thatwork on other R data structures (lapply,sapply, tapply)

In some cases, these represent a speedup overlooping; in others, they are just cleaner from acoding perspective.

Exercise 2

How might one classify sandand rock? What features ofthese curves might we wantto extract.

27

One Last Data Structure: Data Frames

A data frame is in manu ways a matrix, but the columnscan be of different types.

Data frames have 2 dimensions: the number ofobservations, and the number of variables.

In addition to matrix subsetting,we can index a dataframe’s columns by name using the $ operator.

One Last Data Structure: Data Frames> groundcover[1:3,] position type7 2790 sand8 2800 sand9 2810 sand

> groundcover$type[1:3] [1] sand sand sandLevels: plant rock sand>groundcover$position[groundcover$type=="rock"][1] 4490 4500 4510 4520 4530 4540 4550 4560 4570

Other Data Import MethodsThe function read.csv is a wrapper for read.table;that is, it provides a set of good default values thatanticipate what a typical comma-separated file lookslike.

read.table, however, is a wrapper for an evenlower- level function called scan; this function isbetter at reading very large data.

R attempts to coerce the column into logical, integer,numeric and then complex; if that all fails, it will storethe data as a factor

Other Data Import Methods

the.url <- “http://www.stat.ucla.edu/~rosario/boot/lion.txt”data <- read.table(the.url,header=TRUE,sep=“”,skip=12)

header=TRUE tells R that the first line read (not skipped) by R containsthe field names.

sep = “” defines the separator used to separate the fields. Commonseparators (delimiters) are “”, “,”, “\t”, “|”.

skip=12 tells R to skip the first 12 lines of the file (they are not part of thedataset).

Now print out the dataset to the screen. Cool, eh?

28

Exporting DataNow let’s export this data as a more common format: commaseparated values format (CSV).

> write.csv(data,file=“lions.csv”)

We now have a clean CSV file that we can use instead.write.table is similar.

A Bad whichWhich does not behave as expected on matrices. Ittreats a matrix like a vector and returns an index inthe columns of the matrix.

x <- matrix(c(2,4,6,8,10,12,14,16),2)> which(x == 6)[1] 3

This is kind of useless.> matrix.row <- which(x==6) %% nrow(x) #1> matrix.col <- as.integer(which(x==6)/nrow(x)) +

+ which(x==6)%%nrow(x) #2

Linear ModelNow we fit a linear model using a subset of thetemperature data.

> y <- temp[9,1:50]#Position index 9, times 1 through 50> x <- 1:50 #scantimes is weird> X <- cbind(x,x^2)> fit <- lm(y~X)

Linear Model: WTF?lm(y ~ X)

The ~ means we are specifying a formula defining arelationship. Left of the ~ is the dependent variable.Right of the ~ are the independent variables.Predictors can be added (multiple regression) ormultiplied (interactions). We can also subtract off theintercept (~…-1) if we really want to.

Since X contains a linear term and a squared term,we are fitting a parabola to the data via least squaresthat is linear in the coefficients.

29

Linear Model: WTF?> fit <- lm(y~X)> fit

Call:lm(formula = y ~ X)

Coefficients:(Intercept) Xx X 20.54201 1.55558 -0.03211

R named the first column x, hence Xx above.

Linear Model as an ObjectWhat is fit? fit is an object of type lm thatcontains many attributes.

> attributes(fit)

$names

[1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual"

[9] "xlevels" "call" "terms" "model"

$class

[1] "lm”

We can also get more information about themodel…

Linear Model as an Object> summary(fit)

Call:

lm(formula = y ~ X)

Residuals: Min 1Q Median 3Q Max

-6.7655 -1.1819 0.5192 1.7513 3.1578

Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 20.542010 1.039704 19.76 <2e-16 ***

Xx 1.555576 0.094047 16.54 <2e-16 ***X -0.032111 0.001788 -17.96 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.353 on 47 degrees of freedomMultiple R-Squared: 0.8771, Adjusted R-squared: 0.8718

F-statistic: 167.6 on 2 and 47 DF, p-value: < 2.2e-16

The summary Function.We can also use the summary function on objects other thanlinear models. We can also use it as an exploratory analysisstep to analyze the distributions of variables.

> summary(temp[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max.

9.00 15.20 19.10 18.61 21.80 34.10

Here we receive a numerical summary of the temperatures atthe very beginning of the study, over all locations.

The functions that produce these individual statistics are onthe reference card, and are trivial to use.

30

Linear Model as an ObjectUsing this model, we can predict the response given thepredictors.

> points(x,y)

> lines(x,predict(fit))

So, the fit is prettygood!

Linear Model as an ObjectUsing the residuals andfitted attribute, we can plotthe residuals vs. fitted values.Using the lowess function, wecan overlay a red curveindicating any general trend ofthe points, but don’t read toomuch into this curve.

> plot(fit$fitted,fit$residuals,xlab="fitted values",+ ylab="residuals", main="Residuals vs. Fitted Plot")> lines(lowess(fit$fitted,fit$residuals),col="red")

Linear Model as an ObjectWe can also plot the normalquantile plot of the residuals toverify the assumption ofnormality in the residuals.

We can do this using theqqnorm function. The qqlinefunction overlays the lineusually associated with thisplot.

> qqnorm(fit$residuals)> qqline(fit$residuals,lty=2,col="grey")

Tangent: Saving Graph for LaTeXWe can save this graph for inclusion in a LaTeX documentone of two ways. We use File > Save and save the file asa PDF (Mac), or we can write a PDF from the command line

> pdf(file=“myplot.pdf”) #create a PDF> plot(x,y) #proceed with plot> lines(x,predict(fit))> points(x,y,col=3)> lines(x,predict(fit),col=3)> dev.off() #turn off plotting to file.

(Brief Demonstration)

31

Multiple PlotsWe can plot two plots in the same window.

> par(mfrow=c(2,1)) #plot two plots rowwise> plot(…)> plot(…)

> par(mfrow=c(1,2)) #two plots columnwise.> plot(…); plot(…) #multiple commands, same line, separate with ;

To open a new plot window> quartz() #Mac

> x11() #Windows

FactorsReally quickly, a factor represents a categorical variable.We can see the different values of a factor usinglevels().

> levels(groundcover$type)[1] "plant" "rock" "sand"

We will revisit this in 202A.

Storing Commands in a FileCommands can be stored in a text file called an Rscript. We can run the entire R script (say, “foo.R”)by running the command

> source(“foo.R”)

The commands in the file will be run sequentially,and quietly, unless there is an error.

It can also be useful to copy/paste commands fromthis file onto the R command line.

Outputting Tables as LaTeXUsing the verbatim environment in LaTeX works fine foroutput, but for tables we can do better. Using the xtablepackage, we can output the LaTeX that can be used torepresent the table.

> library(xtable) Error in library(xtable) : there is no packagecalled 'xtable’

So we need to install it…

install.packages(“xtable”)

R will ask you which CRAN repository to use. Choose one inCalifornia.

32

Outputting Tables as LaTeX> library(xtable)

Let’s look at the help file for the xtable function first

help(package="xtable")

A boring tableof counts in R…

> table(groundcover$type) plant rock sand 36 9 12

Outputting Tables as LaTeX> xtable(table(groundcover$type))% latex table generated in R 2.5.1 by xtable 1.5-1 package% Sun Sep 23 13:26:30 2007\begin{table}[ht]\begin{center}\begin{tabular}{rr} \hline & V1 \\ \hlineplant & 36 \\ rock & 9 \\ sand & 12 \\ \hline\end{tabular}\end{center}\end{table}

Department and External R Resources

http://info.stat.ucla.edu/grad

Rseek.org

http://www.rseek.orgA Google search engine for R help. It is AWESOME. Helpedme find the answers to all of my questions.

33

Future Department ResourcesAn R forum is in the works where R users fromall over the world can help each other withtheir questions.

No ETA, but soon.

We are hoping graduate students will helpanswer questions, as your support is pivotal.

Interfacing with Other Software

R can interface with several other languages and packages.

•GRASS GIS•ArcGIS•MySQL•MATLAB•Python•C•FORTRAN•Etc, etc, etc.

Statistical Computing at UCLA

Statistics 202A, “Data Mining and Statistical Programming,”Mark Hansen, TA Ryan RosarioStatistics 202B, “Numerical Linear Algebra and Random Numbers” actually “Introto Multivariate Analysis and Matrix Optimization,”Jan de LeeuwStatistics 202C, “Markov Chain Monte Carlo Methods,”Chiara Sabatti, + TA

Just the Beginning!

ON TO 202A!R Questions? Computing Questions?

General [email protected]

34