r: a gentle introduction - data services · part iii: r & rstudio. r and rstudio •r is a...

56
R: A Gentle Introduction Vega Bharadwaj | George Mason University Data Services

Upload: others

Post on 31-May-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

R: A Gentle Introduction

Vega Bharadwaj | George Mason University Data Services

Page 2: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Part I: Why R?

Page 3: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

What do YOU know about R and why do you want to learn it?

Page 4: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Reasons to use R

• Free and open-source

• User-created “packages” available to download allow for an endless number of things you can do in R

• Highly customizable graphing capabilities

• Lots of free documentation and tutorials available on the web

• Beware of bad documentation, too

• You don’t need to be a computer scientist to code in R

• Packages do a lot of the “programming” (applying fundamental CS concepts) for you

Page 5: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

R for historians

https://programminghistorian.org/lessons/data_wrangling_and_management_in_R

Page 6: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

R for scientists

https://rcompanion.org/rcompanion/d_10.html

Page 7: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

R for social scientists

http://personality-project.org/r/psych/HowTo/factor.pdf

Page 8: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

R for text mining

http://tidytextmining.com/tidytext.html

Page 9: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Part II: Why code?

Page 10: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

What is pointing and clicking?

• Clicking, dragging, and using buttons and specified text boxes in user-friendly applications to do things

Page 11: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Drawbacks of pointing and clicking

• Does not usually involve an automatic way to keep track of all of your steps

• The things you can do are limited by the number of buttons available

Page 12: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Example: Microsoft Word

• Suppose you had to edit a Microsoft Word document…• Move paragraph 1 between

paragraphs 3 and 4• Relabel all paragraphs in

chronological order• Move paragraph 2 to the end• Remove all paragraph labels• Create headlines by

boldfacing the first line of each paragraph and separating it from the rest of the paragraph with a single line break

• Indent each paragraph below headlines

• Change font to 12 pt. “Times New Roman”

Page 13: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Example: Microsoft Word, 2

• Troubles with this situation:

• Word’s capabilities are limited

• Lots of rearranging to do (human effort)

• No buttons available through Word that can automatically put paragraphs in the order you want

• No way to document all these steps while ensuring 100% accuracy

• Might be some ambiguity with English language

• What if you had to hand this task over to someone else?

Page 14: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

What is coding?

• The process of writing out a list of instructions for a computer to read, interpret, and do

Page 15: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Summary: pointing and clicking vs. coding

Pointing and clicking Coding

SPECIFICITY Limited by number of

buttons available

Do certain things that

can’t be done through

pointing and clicking

alone

REPRODUCIBILITY No innate way to keep

track of steps while

minimizing error

By nature of coding,

keep track of everything

you do and save these

steps for future use

Page 16: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Part III: R & RStudio

Page 17: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

R and RStudio

• R is a programming language for statisticians

• Uses code to allow you to efficiently reshape datasets, perform statistical tests, and create graphics

• RStudio is an integrated development environment (IDE) for R

• Translates some R commands into point-and-click features

• Provides a user-friendly visual interface in which to code

Page 18: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

R vs. RStudio

Page 19: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Open RStudio

Page 20: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Different ways to code

• Console

• Quickly enter temporary commands

• Script file

• A text document in which you save blocks of code you will want to recreate later

Page 21: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: R as a calculator

• Type “5-3” into the console and hit the “Enter” key

• Things to note:

• “>” indicates where you should enter your input—never type this in yourself!

• “[1]” indicates the output’s first line

Page 22: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: R as a calculator, 2

• Type “5-” into the console and hit the “Enter” key

• Observe what happens

• Type “3” and hit the “Enter” key

• “+” indicates R is expecting more input

Page 23: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: R as a calculator, 3.1

• Create a new R script

Page 24: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Script files (.R)

• Script files are how you save the R code you want to recreate later

• Every line that begins with “#” is a comment, not interpreted by R as code

Page 25: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: R as a calculator, 3.2

• Type “5-3” into your R script, followed by “5+3”, separated by a line break exactly as it looks like below

Page 26: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: R as a calculator, 3.3

• Highlight the first line only and click the “Run” button

Page 27: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: R as a calculator, 3.4

• Examine the output in the “Console”

• Do the same for “5+3”

Page 28: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Concepts: the things you code

• Objects (NOUNS)

• The things you work with in R, i.e. datasets and statistical analysis information

• Functions (VERBS)

• The actions you perform in R, usually on objects

Page 29: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Functions in Excel

• Functions are indicated by their name, followed IMMEDIATELY by parentheses (see text in red)

• Arguments are references to objects (in this case, specific cells) or other types of descriptors that provide information to the function

=SUM(A1, A2)

Page 30: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Object creation

• Use “<-” to create new objects

• Type the object name into the console to get its value

Page 31: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: objects & functions

• Add the following to your R script and run each line:

Pay attention to spacing!

Page 32: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: objects & functions, 2

• Examine the output and the “Environment”

Page 33: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Other types of objects

Page 34: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Packages

• A set of functions and object templates available to download and use directly through RStudio

• Ordinarily, you can open them up using checkboxes, but we will do so using code

Page 35: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Summary: what you do in R

Using code…

1. Create objects (“nouns”)

2. Use functions (“verbs”) to do things to objects

Page 36: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Part IV: Working with Data in R

Page 37: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

What is a CSV file?

• CSV stands for “comma-separated values”

• Preferred format for working with data in R

• Can be opened in Excel

• Why CSV over Excel format?

• .XLSX files can cause problems in R

Page 38: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

CSV: Excel vs. text editor

Page 39: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Functions and datasets

• When working with datasets, it may be necessary to work with more complicated functions

• Arguments without an equals sign (“positional”) must always be in the same spot whenever the function is called

read.table(datafile, header=TRUE, sep=",")

Positional

Argument

Named

Argument

Named

Argument

Function

Page 40: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Other function examples

help()

library(ggplot2)

read.table(datafile, header=TRUE, sep=",")

read.table(datafile,header=TRUE,sep=",")

ggplot(mydata, aes(age, fare)) + geom_point(aes(color =factor(survived)))

Page 41: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Script and dataset management

• For every project, you must create a unique folder on your computer in which to store all your datasets (.CSV files) and scripts (.R files)

Page 42: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Set working directory

• Direct R to the right file folder

Page 43: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Save R script

NO NEED TO SPECIFY FILE EXTENSION!

Page 44: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: Load in data

Page 45: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Load in data, 2

Page 46: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Load in data, 3

Page 47: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Load in data, 4

Copy & paste into R script

Page 48: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Load in data, 5

• Highlight and run each line

• Examine output in the CONSOLE

Page 49: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Load in data, 6

• You can close out of dataset and click “titanic_r” in ENVIRONMENT to open it up again

Page 50: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Installing and loading packages

• To install a package, use the install.packages() function followed by the package name in double quotes

• To load a package, use the library() function followed by the package name without quotes

Page 51: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Using the ggplot2 package

• Copy and paste the following text into your script (after inserting some line breaks):

# install.packages(“ggplot2”)

library(ggplot2)

• REMEMBER: the “#” indicates a comment that is not interpreted by R as code

• We left this function as a comment because ggplot2is already installed

Page 52: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: Exploratory data analysis• Copy and paste the following lines of code into your

script (after inserting some line breaks):

head(titanic_r)

str(titanic_r)

summary(titanic_r$gender)

table(titanic_r$pclass, titanic_r$gender)

Page 53: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Exercise: Create graphs

• Copy and paste the following lines of code into your script (after inserting some line breaks):

qplot(pclass, fill=gender, data=mydata)

ggplot(titanic_r, aes(age, fare)) + geom_point(aes(color = factor(survived)))

Page 54: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Things to remember

• R is case-sensitive

• No spaces between function name and opening parenthesis

• Comment every block of code

• Leave line breaks after every code block

Page 55: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

For more coding practice:

http://infoguides.gmu.edu/learn_r/101

Page 56: R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a programming language for statisticians •Uses code to allow you to efficiently reshape

Workshop resources:

https://dataservices.gmu.edu/workshops/r