intro to r programming language handout

15
Introduction to R and RStudio We will be using RStudio in this course to carryout statistical procedures. RStudio uses the coding language of R. Thus, in order to use RStudio, you must first download R. The same code can be run in R and RStudio, but we will use RStudio, because it tends to be easier to work with and has a nicer display than R. Lawrence University RStudio Server The first option for using RStudio is to use the Lawrence University RStudio server. This can be accessed at https://rstudio.lawrence.edu. Use your Lawrence email username and password to log in. This site can only be accessed on the Lawrence campus. Downloading R and RStudio The second option for using RStudio is to download it. This will allow you to use RStudio when you are not on campus and when you are no longer at Lawrence. The first step in the process is to download R. Follow these instructions to do so. 1. Go to https://www.r-project.org. 2. Click on the link that says download R. This will take you to a page of CRAN Mirrors from around the world. 3. Scroll down until you reach the USA mirrors. 4. Click on any of the options under USA, such as http://cran.cnr.Berkeley.edu/. This will direct you to another page, where you will have the option of downloading R for Linux, Mac, or Windows. 5. Select the appropriate link based on the type of operating system your computer uses and follow the in- structions for downloading the latest version of R. (The most current version is R 3.2.1 ”World-Famous Astronaut” released on 2015/06/18.) To check that you have correctly downloaded R, open R, and it should look like one of the following displays based on your operating system (Linux will appear similar to Windows). Next you will download RStudio. Follow these instructions to do so. 1. Go to https://www.rstudio.com. 2. Place your cursor over products and click on RStudio. This will take you to a page with the options of either selecting desktop or server. 1

Upload: nick-ashley

Post on 09-Apr-2016

37 views

Category:

Documents


4 download

DESCRIPTION

Beginners guide to R studio programming basics

TRANSCRIPT

Page 1: Intro to  R programming language Handout

Introduction to R and RStudio

We will be using RStudio in this course to carryout statistical procedures. RStudio uses the coding language ofR. Thus, in order to use RStudio, you must first download R. The same code can be run in R and RStudio, butwe will use RStudio, because it tends to be easier to work with and has a nicer display than R.

Lawrence University RStudio Server

The first option for using RStudio is to use the Lawrence University RStudio server. This can be accessed athttps://rstudio.lawrence.edu. Use your Lawrence email username and password to log in. This site can only beaccessed on the Lawrence campus.

Downloading R and RStudio

The second option for using RStudio is to download it. This will allow you to use RStudio when you are not oncampus and when you are no longer at Lawrence.

The first step in the process is to download R. Follow these instructions to do so.

1. Go to https://www.r-project.org.

2. Click on the link that says download R. This will take you to a page of CRAN Mirrors from around theworld.

3. Scroll down until you reach the USA mirrors.

4. Click on any of the options under USA, such as http://cran.cnr.Berkeley.edu/. This will direct you to anotherpage, where you will have the option of downloading R for Linux, Mac, or Windows.

5. Select the appropriate link based on the type of operating system your computer uses and follow the in-structions for downloading the latest version of R. (The most current version is R 3.2.1 ”World-FamousAstronaut” released on 2015/06/18.)

To check that you have correctly downloaded R, open R, and it should look like one of the following displays basedon your operating system (Linux will appear similar to Windows).

Next you will download RStudio. Follow these instructions to do so.

1. Go to https://www.rstudio.com.

2. Place your cursor over products and click on RStudio. This will take you to a page with the options of eitherselecting desktop or server.

1

Page 2: Intro to  R programming language Handout

3. Click on desktop.

4. On this page, under the open source edition, click on Download RStudio Desktop. This will take you to apage with various platform installers.

5. Select the appropriate one for your operating system, and begin the installation process.

To check that you have correctly downloaded RStudio, open RStudio, and it should look like the following (thedisplay will be similar for all three operating systems).

Becoming Familiar with RStudio

The display of RStudio is broken into four sections as seen in the image below.

• On the top left is an rscript. This is where you will type your code and is a document that can be savedwhen you are done working, so you can return to it at a later time.

• Below the rscript, in the bottom left corner, is the console. This is where the code is run and where resultswill be produced.

• To the right of the console, in the bottom right corner, is the space where figures will be displayed when youcreate them. It is also where files can be accessed, help information is displayed, and packages can be loaded(but we will not discuss packages now and maybe not at all in this course).

• The top right area displays your environment, which includes current data sets you have loaded, variablesyou have created, and so forth. (Since you have not worked with RStudio yet, there should be nothing inyour environment.)

2

Page 3: Intro to  R programming language Handout

Using RStudio as a Calculator and Working with an Rscript

Opening a New Rscript

If there is not an rscript already open, click on the sheet with a white plus sign inside a green circle at the top leftof RStudio, and select R Script from the dropdown menu. This will provide you with an untitled blank rscript.

Some Simple Computations

We will first use RStudio as a calculator. In the blank rscript, type the following:

2+2

In order to perform the computation, place your cursor on the same line as 2+2. Depending on whether you areusing a Mac or PC, use the following command.

Operating System Command

Mac command+returnPC control+enter

(With any operating system, you can also copy and paste the code into the console.) This will send 2+2 to theconsole and tell it to run. The answer will be returned in the console.

Try using RStudio to compute the following expressions.

2-2

100/10

sqrt(4)

(200-5)*4+sin(2)

The output should appear as follows.

3

Page 4: Intro to  R programming language Handout

> 2-2

[1] 0

> 100/10

[1] 10

> sqrt(4)

[1] 2

> (200-5)*4+sin(2)

[1] 780.9093

Saving an Rscript

Suppose that you are interested in saving the rscript after typing these expressions. In order to do this, hit thesave button that is at the top of the rscript (not the one on the toolbar under the dropdown folder of code).

When you close out of RStudio, it will ask if you want to save your workspace. For our purposes in this course,you should always say no. Instead, make sure to save the code you have typed in an rscript.

Variables and Vectors

When working with data, we find ourselves working with variables that have a list of values that have been col-lected in the sample. One way to work with these variables in R, is to create a vector. In R, a vector is an objectwith a list of values assigned to it. There is a lot that can be done with vectors, but the following section will givea brief introduction to them.

We can make vectors by using <- to assign values to a name. Suppose that we want to create a vector calledyellow with only the value 1 assigned to it. We can do this by typing the following code in the rscript and runningit in the console.

yellow <- 1

Now, when I type yellow in the console and hit enter, it will output the number 1. (Note that we can leave outthe spaces, and the command will work in the same way. yellow<-1)

We can create vectors with more than one value by using the command c(), which stands for concatenate, tocreate the list of values. For example, if we want to create a vector called red with the values 2, 3, 4, and 5 in it,we could do so with the following code.

4

Page 5: Intro to  R programming language Handout

red <- c(2,3,4,5)

Then, when red is run in the console, the list of numbers assigned to it will be returned.

We can give our vectors whatever name we like, but we cannot include spaces in their name. For example, thefollowing code will produce an error.

a vector <- c(1,2,3,4)

Instead, we need to remove the space, or we could put a period in place of the space.

avector <- c(1,2,3,4)

a.vector <- c(1,2,3,4)

It is also important to know that capitalization is recognized by R. We will find that the vector red that we createdbefore is not the same as Red.

We can perform mathematical operations with vectors of the same length. For example, if we create a new variablecalled blue with four values in it, we can add, subtract, multiply, and divide red with blue using the followingcode.

blue <- c(1,2,3,4)

red+blue

red-blue

red*blue

red/blue

5

Page 6: Intro to  R programming language Handout

When the code is run in the console, it appears as follows.

When performing mathematical operations on multiple vectors, it creates a new vector. We can assign a name tothe new vector. For example,

purple <- red+blue

We can ask RStuido to tell us what the value in a certain position of a vector is. For example, if we are interestedin determining what the value in the third place of the the vector purple is, we can do so using brackets.

purple[3]

We can also create vectors containing non-numerical values if we are working with a categorical variable. In orderfor RStudio to recognize that the values in the vector are categories instead of numbers, we need to put quotationmarks around each value. For example, we may encounter a variable of yes and no responses. We can create sucha vector using the following code.

responses <- c("yes","no","no","yes")

6

Page 7: Intro to  R programming language Handout

Try creating vectors of the variables measurement and sex from the example dataset in class.

Subject Treatment Measurement Sex Age Subject Treatment Measurement Sex Age

1 drug 5 M 21 7 placebo 6 F 302 drug 8 M 25 8 placebo 2 F 283 drug 6 M 25 9 placebo 5 M 224 drug 14 F 21 10 placebo 3 M 225 drug 7 F 20 11 placebo 3 M 246 drug 6 M 20

The code and output should look as follows.

> measurement <- c(5,8,6,14,7,6,6,2,5,3,3)

> measurement

[1] 5 8 6 14 7 6 6 2 5 3 3

> sex <- c("M","M","M","F","F","M","F","F","M","M","M")

> sex

[1] "M" "M" "M" "F" "F" "M" "F" "F" "M" "M" "M"

Creating a dataset in RStudio

If we are working with a small dataset, it can be useful to simply type the data into RStudio. We just saw how tocreate a single vector in RStudio, but in order for RStudio to recognize them as a dataset, we use the commanddata.frame. For example, if we want RStudio to recognize the example dataset from class, we can use data.framein the following way. Note that we use <- to assign a name to the dataset, but inside the command data.frame,we have to use = to create the variables. Also, the way I have formatted the code causes it to take up multiplelines in an rscript. When transferring the code from the rscript to the console, send the first line to the console.When you do this, the symbol + appears at the command line instead of >. RStudio has recognized that thecode entered is incomplete, and it is waiting for additional code before running the code entered. Thus, move thecursor to the next line of code in the rscript (it may do this automatically for you), and send this line of code tothe console. Continue to do this until the whole chunk of code is in the console. To check if the dataset has beenentered correctly, type data into the console, and the data frame will print as seen in the image below.

data <- data.frame(treatment=c("drug","drug","drug","drug","drug","drug",

"placebo","placebo","placebo","placebo","placebo"),

measurement=c(5,8,6,14,7,6,6,2,5,3,3),

sex=c("M","M","M","F","F","M","F","F","M","M","M"),

age=c(21,25,25,21,20,20,30,28,22,22,24))

7

Page 8: Intro to  R programming language Handout

Uploading a .csv Dataset

We can also upload a dataset to RStudio that is stored on our computer in a different type of document such asa .txt file, an excel document, or .csv file. However, we are only going to work with .csv files for now.

I have uploaded a dataset called Cereals.csv to Moodle. Download the dataset, and save it to a folder on yourcomputer where you know where it is. (I would recommend creating a new folder for this course.)

If you are using the RStudio server, you get a small amount of storage space on the server, and you will need toupload the dataset to your storage. To do this, I recommend creating a new folder on the server. Click on theNew Folder button shown in the picture below. I created the folder called data. Then go into your new folder,click on the upload button seen in the picture below, and upload Cereals.csv to your new folder. If you are notusing the server, ignore this paragraph.

In order to import the data into R so we can use it, we need to set our working directory to the location thatwe have the data file saved in. To change your working directory, go to Session, then Set Working Directory, andselect Choose Directory.

This will open a popup. Find the folder that you have the data saved in, highlight it, and hit open. Now RStudiowill be able to access data files that you have saved in that folder.

8

Page 9: Intro to  R programming language Handout

Next we can use the command read.csv() to load your data into R. To do this, type the name of the .csv fileinside the parentheses of read.csv() with quotation marks around the name. If the .csv file has the variablecolumns labelled (as the document Cereals.csv does as shown in the image below), put a comma after the file nameand include header=TRUE. This tells RStudio to treat the first row of entries in the .csv document as the variablenames and the values in the second row as the start of the variable entries. If the columns are not labelled, theninclude header=FALSE. Then RStudio will treat the entries in the first row as the start of the variable entries.

It is important that you name the data so you can work with it easily after it has been loaded into RStudio. Again,we will use <- to name the data.

Thus, to upload the file Cereals.csv to RStudio and name it, use the following command. I have chosen tocall the data cereals.

cereals <- read.csv("Cereals.csv",header=TRUE)

If the file has uploaded correctly, you can type cereals into the console, and it will print the data.

9

Page 10: Intro to  R programming language Handout

Working with a Dataset and Some Simple Computations

Exploring the Uploaded Dataset

Once your data is uploaded to RStudio, it is always a good idea to get a general understanding of the dataset.The following commands are helpful.

str() Displays the general structure of the datasummary() Gives some summary statistics for each of the variables in the datahead() Displays the top several rows of the datanames() Lists the names of the variables in the dataset.

For example, the command

summary(cereals)

produces the following output.

> summary(cereals)

ID Age Shelf Sodiumgram Proteingram

Min. : 1.0 adult :17 bottom: 9 Min. :0.000000 Min. :0.03030

1st Qu.:11.5 children:26 middle:19 1st Qu.:0.003150 1st Qu.:0.03391

Median :22.0 top :15 Median :0.004839 Median :0.06667

Mean :22.0 Mean :0.004634 Mean :0.08162

3rd Qu.:32.5 3rd Qu.:0.006481 3rd Qu.:0.09717

Max. :43.0 Max. :0.007407 Max. :0.26667

Accessing Variables in the Dataset

As mentioned above, the command str() displays the general structure of the data. For example, we obtain thefollowing output if we enter str(cereals) into the console.

’data.frame’: 43 obs. of 5 variables:

$ ID : int 1 2 3 4 5 6 7 8 9 10 ...

$ Age : Factor w/ 2 levels "adult","children": 1 2 2 2 1 2 2 2 2 2 ...

$ Shelf : Factor w/ 3 levels "bottom","middle",..: 1 1 1 1 1 1 1 1 1 2 ...

$ Sodiumgram : num 0.007 0.00667 0.00467 0.00697 0.007 ...

$ Proteingram: num 0.1 0.0667 0.0333 0.0303 0.1 ...

It lists the variables in the dataset and what type of variable they are. Before each variable is a $, which is knownas the extract operator. Thus, in order to access a variable in the dataset, we must first type the name of thedataset, a $, and then the name of the variable in the dataset. For example, if we want to print just the variableSodiumgram from the dataset cereals, we would type:

cereals$Sodiumgram

The output from this command is as follows. (Note that cereals$Sodiumgram is a vector.)

> cereals$Sodiumgram

[1] 0.007000000 0.006666667 0.004666667 0.006969697 0.007000000 0.006000000 0.006129032

[8] 0.004838710 0.001851852 0.005517241 0.006666667 0.004500000 0.004375000 0.007096774

[15] 0.007000000 0.006785714 0.004545455 0.005000000 0.004687500 0.003833333 0.004500000

[22] 0.006666667 0.006296296 0.007407407 0.004375000 0.005333333 0.005666667 0.004848485

[29] 0.002200000 0.007000000 0.003500000 0.001792453 0.004500000 0.002800000 0.000222222

[36] 0.001634615 0.002800000 0.005818182 0.002727273 0.005600000 0.000000000 0.000000000

[43] 0.002452830

10

Page 11: Intro to  R programming language Handout

If you want to work with the variables without typing as much, you can rename the variable. For example:

Sodiumgram <- cereals$Sodiumgram

Now, just by typing Sodiumgram, it will print the same output as cereals$Sodiumgram does.

Performing Commands on the Uploaded Dataset

We can use RStudio to easily perform many of the computations we discussed in class on quantitative variablesin a dataset (and vectors in general). The table below contains the functions for the computations.

Computation Function

mean mean()

median median()

standard deviation sd()

variance var()

quantiles quantile()

range range()

minimum min()

maximum max()

five number summary summary()

In order to use the functions on a variable, type the name of the variable inside the parentheses. For example, ifI want to compute the mean of Sodiumgram, I will type

mean(cereals$Sodiumgram)

or

mean(Sodiumgram)

since I renamed the variable.

Tables

You can create tables in RStudio with categorical variables using the command table(). For example, to createa frequency table of the different levels of the variable age in the dataset cereals, use the command

table(cereals$Age)

This produces the following output. We see that 17 of the cereals are considered adult cereals, and 26 are consideredchildren’s cereals.

adult children

17 26

To create a two-way table, include the two variables of interest inside the table() command with a commabetween the variables. For example,

table(cereals$Age,cereals$Shelf)

creates a two-way frequency table of the variables age and shelf.

bottom middle top

adult 2 1 14

children 7 18 1

11

Page 12: Intro to  R programming language Handout

If you are interested in creating a relative frequency table with proportions (instead of just the frequencies), firstcreate a table of the desired variable(s), and give it a name. Then divide the table by the summation of the entriesin the table. For example, to create a relative frequency table of the variable age, use the following command.

tab <- table(cereals$Age)

tab/sum(tab)

It produces this output.

adult children

0.3953488 0.6046512

Similarly, we can also use the same procedure for the two-way table.

tab2 <- table(cereals$Age,cereals$Shelf)

tab2/sum(tab2)

The output is as follows.

bottom middle top

adult 0.04651163 0.02325581 0.32558140

children 0.16279070 0.41860465 0.02325581

Graphing

There are three basic plotting functions in RStudio: high-level plots, low-level plots, and the layout command par

(which will not be discussed in this handout). Basically, a high-level plot function creates a complete plot and alow-level plot function adds to an existing plot, that is, one created by a high-level plot command.

High-Level Plot Functions

There are many plots that can be created in RStudio, but for now, we will just focus on three: barplots, histograms,and boxplots. The plot functions for these are included in the table below.

Graph Command

barplot barplot()

histogram hist()

boxplot boxplot()

Suppose we want to create a barplot of the age levels in the cereal dataset. We can do this using the followingcommand.

barplot(table(cereals$Age))

Note that you must use the table() command inside the command barplot() when creating a boxplot if thecategorical variable is a vector as cereals$Age is. The plot will appear in the lower right hand corner of RStudioas follows.

12

Page 13: Intro to  R programming language Handout

If we want to make a histogram and boxplot of the variable Sodiumgram, we can apply the commands to thevariable, and the graphs below are created.

hist(cereals$Sodiumgram)

boxplot(cereals$Sodiumgram)

Histogram of cereals$Sodiumgram

cereals$Sodiumgram

Fre

quen

cy

0.000 0.002 0.004 0.006 0.008

02

46

8

0.00

00.

004

These plots are fairly basic (and are lacking good titles and labels). We can add additional arguments to the codeto enhance the plots. The table below includes some options.

Option Description

col color (color=“red”, “blue”,...)xlim x-axis limits: xlim=c(min,max)ylim y-axis limitsxlab x-axis label: xlab=“my label”ylab y-axis labelmain main titlesub sub title

For example, we can improve the boxplot by adjusting the code in a way such as the following. Note that Ihave I have included an additional option specific to boxplots: horizontal=TRUE. This causes the boxplot to behorizontal instead of the default of vertical.

boxplot(cereals$Sodiumgram,col="grey",ylim=c(-0.0005,0.009),xlab="Sodium (grams)",

main="Boxplot of Amount of Sodium in Sampled Cereals", horizontal=TRUE)

0.000 0.002 0.004 0.006 0.008

Boxplot of Amount of Sodium in Sampled Cereals

Sodium (grams)

Additionally, we can easily create side-by-side boxplots in RStudio. For example, if we want to compare thesodium amounts of cereals by the age the cereal is marked to we can use the following command to do so. Notethat we do this by including a ∼ after the quantitative variable and then list the categorical variable of interest.The code also includes names=c("Adult","Children"), which allows us to adjust the category names.

boxplot(cereals$Sodiumgram~cereals$Age,col="grey",ylim=c(-0.0005,0.009),xlab="Sodium (grams)",

ylab="Age", main="Boxplots of Amount of Sodium in Sampled Cereals by Age",

names=c("Adult","Children"),horizontal=TRUE)

13

Page 14: Intro to  R programming language Handout

Adu

ltC

hild

ren

0.000 0.002 0.004 0.006 0.008

Boxplots of Amount of Sodium in Sampled Cereals by Age

Sodium (grams)

Age

Low-Level Plot Functions

Low-level plot functions can be executed only after a high-level plot has been created. There are lots of optionsfor this, but for now, I will only include the command abline(), which adds a line to a plot after it has beencreated. This can be useful for displaying a statistic on the plot. For example, the following code places a dashedline on the histogram of cereal sodium amounts at the mean sodium amount. Note that the v causes the line atthe mean to be vertical. If you require a horizontal line, use h instead. Also, lty indicates the type of line to beused. Try using other numbers to view additional line types.

hist(cereals$Sodiumgram,xlab="Sodium (grams)",main="Histogram of Amount of Sodium in Sampled

Cereals", sub="(dashed line represents mean amount of sodium)")

abline(v=mean(cereals$Sodiumgram),lty=2)

Histogram of Amount of Sodium in Sampled Cereals

(dashed line represents mean amount of sodium)Sodium (grams)

Fre

quen

cy

0.000 0.002 0.004 0.006 0.008

02

46

810

Other options for the line are listed in the table below.

Option Description

lty line type (lty=1, 2,...)lwd line thickness (lwd=1, 2,...)col color (color=“red”, “blue”,...)

Saving Plots

There are several ways to save a plot in RStudio. I will only discuss one way. Once the plot has been created, clickon the export button above the plot, and select either Save as Image... or Save as PDF... depending on which youwould prefer. A window will appear that will allow you to name the plot, choose where you want to store it, andsave it.

14

Page 15: Intro to  R programming language Handout

General Tips for Using RStudio

• There are often many ways to perform the same action in RStudio. If you find a way that you prefer, feelfree to use it.

• When you are first learning RStudio (and even when you have worked with it for a long time), errors oftenare the result of a missing comma, a misspelled word, the wrong capitalization, etc. Be precise and patientwhen typing your code.

• Type colors() in the console to see the list of colors available for the plotting commands.

• If you place your cursor at the command line in the console and hit the up arrow, previously entered codewill appear, which can be helpful.

• If you are ever looking for help with a command in R, type the command with a question mark before it intothe console. For example, ?mean. Then information will appear in the bottom left corner of the window.The internet is also a great resource for help with R and RStudio.

• If you type # at the beginning of a line of code, RStudio will treat the typing after # as a comment. It canbe a good idea to include comments in rscripts as a title, to explain what is happening in the code, etc.

15