20130215 reading data into r
TRANSCRIPT
Reading and Manipulationg
data in2013-02-15 @HSPH
Kazuki Yoshida, M.D. MPH-CLE student
FREEDOMTO KNOW
Reading data in
n Usually the first task in real-life data analysis.
Supportedn .RData (native) files: load()
n .csv files: read.csv()
n .xls/.xlsx files: gdata::read.xls() or xlsx::read.xlsx()
n .sas7bdat files: sas7bdat ::read.sas7bdat()
n .dta files: foreign::read.dta()
n and more...http://cran.r-project.org/doc/manuals/R-data.html
foreign::read.dta()
package name(packages add functions) function name
functions are followed by (),in which you specify arguments
Create a folder for this group
Open R Studio
Make sure your working directory
is correct
Download files
n Rosner (ASCII, comma-separated and Stata): http://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20bI&product_isbn_issn=9780538733496
n Hernan (Excel and SAS): http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
.csvhttp://www.wondergraphs.com/img/SFO_Landings.csv
For comma-, tab-, or space-separated text
new.dat <- read.csv(“file.csv”)
name of object to create
file name herefunction to read .csv files
assignment operator
Space separated
http://www.biostat.harvard.edu/~fitzmaur/ala2e/tlc.dat
read.table(“file.dat”)or
read.table(“file.dat”, header = T)
http://www.biostat.harvard.edu/~fitzmaur/ala2e/tlc.dat
tab-separated
read.delim(“file.tsv”)http://www.brookscole.com/cgi-wadsworth/
course_products_wp.pl?fid=M20b&flag=student&product_isbn_issn=9780495384
960&disciplinenumber=1038&template=AUS
Excel files
Install xlsx package
Just click box to load
To install/load a package
install.packages(“package”, dep = T)
library(package)
xlsdat <- read.xlsx(“file.xls”, 1)
name of object to create
file name herefunction to read .xlsx files
assignment operator
sheet number
library(sas7bdat)sasdat <- read.sas7bdat(“file.sas7bdat”)
SAS native files
library(foreign)xptdat <- read.xport(“file.xpt”)
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2009-2010/DEMO_F.xpt
SAS xport files
library(foreign)statadat <- read.dta(“file.dta”)
http://www.biostat.harvard.edu/~fitzmaur/ala2e/headache.dta
Fixed width
fwfdat <- read.fwf(“file.txt”, width = c(3, 5, ...))
Use width = list(c(3,5,..), c(5,7,..)) for multiple rows per subject
Manipulating data in R
n Objects
n Classes
n Various data objects
Objects
n Just about everything named in R is an object
n An object is a container that
n knows its class (eg, I have numbers inside!).
n has contents (eg, Actual numbers).
Examples of objects
n data, which you use for analysis (various classes)
n functions, which perform analysis (function class)
n results, which come out of analysis (various classes)
Classes of data values inside data objects
n Numeric: Continuous variables
n Factor: Categorical variables
n Logical: TRUE/FALSE binary variables
n etc...
Class?
n An object’s class tells R how the object should be handled.
n For example, summarizing data should work differently for numbers and categories!
Data objects
n Vector (contains single class of data values)
n List (contains multiple classes of data values)
Data objects
n Vector (contains single class of data values)
n Array including Matrix
n List (contains multiple classes of data values)
n Data frame
Vector
n Smallest building block of data objects
n Single dimension
n Combination of values of same class
n vec1 <- c(2013, 2, 15, -10) # combine
n vec2 <- 1:16 # integers 1 to 16
Arrayn Vector folded into a multidimensional structure
n 2-dimensional array is a matrix
n vec3 <- 1:16
n dim(vec3) <- c(4, 4) # 4 x 4 structure
n dim(vec3) <- c(2, 2, 4) # 2 x 2 x 4 structure
n arr1 <- array(1:60, dim = c(3,4,5))
List
n Combination of any values or objects
n Can contain objects of multiple classes
n eg, a list of two vectors, a matrix, three arrays
n list1 <- list(first = 1:17, second = matrix(letters, 13,2))
n list2 <- list(alpha = c(1,4,5,7), beta = c("h","s","p","h"))
Data frame
n Special case of a list
n List of same-length vectors vertically aligned
n df1 <- data.frame(list2)
n list3 <- list(small = letters, large = LETTERS, number = 1:26)
n df2 <- data.frame(list3)
Access by indexes
n letters[3] # 1-dimensional object
n arr1[1,2,3] # 3-dimensional object
n arr1[1, ,3] # implies 1,(all),3
n df1[ ,3] # implies (all),3
n list1[[1]] # list needs [[ ]]
Access named elements
n list3
n list3$small
n list3[["small"]]
n df1$large
n df1[, "large"]