introduction to r - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/intro.pdf ·...
TRANSCRIPT
Using R Basics Data Manipulation
Introduction to R
Peter Dalgaard
Department of BiostatisticsUniversity of Copenhagen
Statistical Practice in Epidemiology, Tartu 2006
Using R Basics Data Manipulation
Outline
Using R
Basics
Data Manipulation
Using R Basics Data Manipulation
What is R?
I R is an “enviroment for statistical computing and graphics”I Highly flexible graphics routinesI Statistical functions (standard tests, modelling)I Controlled by a programming language
I In this course we use R exclusivelyI The first practical is a workbook exercise designed to help
you getting started with RI This lecture is intended to give you the broader picture
Using R Basics Data Manipulation
Basics of R
I What is R?I Interacting with RI Extended user interfacesI Later: Dealing with R’s workspace
Using R Basics Data Manipulation
Key Points about R
I Environment built around the programming language R,(an Open Source dialect of the S language).
I R is Free Software, and runs on a variety of platforms (I’llbe using Linux. Computer labs run on Windows.)
I Command-line execution based on function callsI Extensible with user functionsI Workspace containing data and functionsI Graphics devices
Using R Basics Data Manipulation
Interacting with R
I Command line interface (CLI)I The basic mode of interaction is “read – evaluate – print”I User types an expression at the command line,I R evaluates itI . . . and prints the resultI Batch variation: read commands from a file
Using R Basics Data Manipulation
Extended Interfaces
I Windows, Macintosh GUI: Fairly simple extensions of CLI,mostly offloads some tasks to menu interface, and addscommand recall
I Script editing: The ability to work with multiple lines of Rcode, save them to a file for later use, etc. A simple scripteditor is built into the R GUI in recent versions.
I External editor interfaces: TINN-R, R-WinEdt adds syntaxhighlighting. Highly recommended.
I R embedded in a text editor (ESS – Emacs SpeaksStatistics). Popular on Unix/Linux systems.
Using R Basics Data Manipulation
Demo 1
2+2log(10)help(log)summary(airquality)demo(graphics) # pretty pictures...
Using R Basics Data Manipulation
R packages
I An important new thing in R has been its handling ofadd-on packages
I Standard formatI Easy end-user handlingI Quality control system (portability, version dependency)
I CRAN – Comprehensive R Archive Network, modelled onCTAN (TeX), CPAN (Perl). Kurt Hornik, Fritz Leisch,TU-Vienna
I Currently over 500 packages on CRAN.
Using R Basics Data Manipulation
Language
I R is a programming language – also on the command lineI Basic structure: Functions acting on objectsI (Functions are also a kind of object, operators a kind of
function)I Print an object by typing its nameI Evaluate an expression by entering it on the command lineI Call a function, giving the arguments in parentheses –
possibly emptyI Notice ls vs. ls()
Using R Basics Data Manipulation
Objects
I The basic object type is the vectorI Modes: numeric, integer, character, generic (list)I Operations are vectorized: you can add entire vectors witha + b
I Recycling of objects: If the lenghts don’t match, the shortervector is reused
Using R Basics Data Manipulation
Demo 2
x <- round(rnorm(10,mean=20,sd=5)) # simulate dataxmean(x)m <- mean(x)x - m # notice recycling(x - m)^2sum((x - m)^2)sqrt(sum((x - m)^2)/9)sd(x)
Using R Basics Data Manipulation
Smart indexing
I R has several unusual but highly useful indexingmechanisms:
I a[5] single elementI a[5:7] several elementsI a[-6] all except the 6thI a[b>200] logical indexI a["name"] by name
Using R Basics Data Manipulation
Lists
I Lists are vectors where the elements can have differenttypes
I Functions often return listsI lst <- list(A=rnorm(5), B="hello")
I Special indexing:I lst$A
I lst[[1]] first elementI lst[1] list containing the first element
Using R Basics Data Manipulation
Functions
I logit <- function(p) log(p/(1-p))
I logit(0.5)
I Formal argumentsI Actual argumentsI Positional matching: plot(x,y)I Keyword matching: t.test(x ~ g, mu=2,alternative="less")
I Partial matching: t.test(x ~ g, mu=2, alt="l")
Using R Basics Data Manipulation
Compound objects
I Attributes (dimensions, dimnames)I Allows you to define complex datastructures
I Matrices, arrays, tablesI Factors (categorical variables)I Data framesI Return values from tests, model fits
Using R Basics Data Manipulation
Classes, generic functions
I R objects have classes (there are two different classsystems, but ignore that for now)
I Functions can behave differently depending on the class ofan object
I E.g. summary(x) or print(x) does different things if xis numeric, a factor, or a linear model fit
Using R Basics Data Manipulation
Data Manipulation Functions
I Constructors of simple objectsI Single-column modificationsI Modifying and subsetting data frames
Using R Basics Data Manipulation
Constructors
I R deals with many kinds of objects besides data setsI Need to have ways of constructing them from the
command lineI We have (briefly) seen the c and list functionsI Notice the naming forms c(boys=1.2, girls=1.1)
I Extracting and setting names with names(x)
I For matrices and arrays, use the (surprise) matrix andarray functions. data.frame for data frames.
I It is also fairly common to construct a matrix from itscolumns using cbind
Using R Basics Data Manipulation
Demo 3
x <- c(boys = 1.2, girls = 1.1)xnames(x)names(x) <- c("M", "F")xmatrix(1:4,ncol=2)cbind(x=0:3,"exp(x)"=exp(0:3))
Using R Basics Data Manipulation
The factor Function
I This is typically used when read.table gets it wrongI E.g. group codes read as numericI Or read as factors, but with levels in the wrong order (e.g.c("rare", "medium", "well-done") sortedalphabetically.)
I Notice the slightly confusing use of levels and labelsarguments.
I levels are the value codes on inputI labels are the value codes on output (and become the
levels of the resulting factor)
Using R Basics Data Manipulation
Demo 4
aq <- airqualityaq$Month <- factor(aq$Month, levels=5:9,
labels=month.name[5:9])aq$Monthlevels(aq$Month) <- month.abb[5:9]aq$Month
Using R Basics Data Manipulation
The cut Function
I The cut function converts a numerical variable into groupsaccording to a set of break points
I Notice that the number of breaks is one more than thenumber of intervals
I Notice also that the intervals are left-open, right-closed bydefault (right=FALSE changes that)
I . . . and that the lowest endpoint is not included by default(set include.lowest=TRUE if it bothers you)
Using R Basics Data Manipulation
Demo 5
library(ISwR); data(juul)age <- subset(juul, age >= 10 & age <= 16)$agerange(age)agegr <- cut(age, seq(10,16,2), right=FALSE,
include.lowest=TRUE)length(age)table(agegr)agegr2 <- cut(age, seq(10,16,2), right=FALSE)table(agegr2)
Using R Basics Data Manipulation
Working with Dates
I Dates are usually read as character or factor variablesI Use the as.Date function to convert them to objects of
class "Date"I If data are not in the default format (YYYY-MM-DD) you
need to supply a format specification> as.Date("11/3-1959",format="%d/%m-%Y")[1] "1959-03-11"
I You can calculate differences between Date objects. Theresult is an object of class "difftime", with a unit ofdays. You need as.numeric to get the actual number.