r text-based data i/o and data frame access and manupulation
DESCRIPTION
A tutorial intended primarily for beginners covering the classic methods of getting data into and out of R and accessing data in data frames in R.TRANSCRIPT
R Text-Based Data I/OR Data Frame Access and Manipulation
Ian M. Cook
September 29, 2010
R Data I/O, Access, and Manipulation2 September 29, 2010
Background Information
R Data I/O, Access, and Manipulation3 September 29, 2010
Data Types
R has several important data types:
• numeric(stores integers and floating point real numbers)
• character(stores strings of characters, not single characters)
• logical(stores TRUE or FALSE)
R Data I/O, Access, and Manipulation4 September 29, 2010
Data Containers
The most basic data storage container in R is a scalar, a 1x1 unit of data. A scalar might contain a unit of numeric, character, or logical data.
A 1-dimensional array of scalars in R is a vector.
A 2-dimensional array of scalars in R can be a matrix or a data frame. (The focus here is on data frames. Matrices are often less useful and less accessible so are not covered in this presentation.)
R also has other data containers, including lists, which are important to know about but are often less useful for data analysis purposes.
R Data I/O, Access, and Manipulation5 September 29, 2010
Data Containers
A vector can be created in R using the function c(). To create several vectors of various lengths containing numerical, character, and logical data, we can enter
v1 <- c(1, 3, 9, 3.14159, -88.1, 0)v2 <- c("abc","def","ghi")v3 <- c(TRUE, FALSE, TRUE, TRUE)
Data types cannot be mixed within a vector. Entering mixed data types into a vector using the c() function converts all non-character entries into character representations.
R Data I/O, Access, and Manipulation6 September 29, 2010
Data Frames
A data frame is a rectangular array, with each column representing a variable.
Different columns in a data frame may have different data types. (E.g. a data frame might have character strings in column 1, numerical values in column 2, and logical values in column 3.)
A data frame can be created in R using the function data.frame(), but it is often more useful to input a data frame from an external data file or database.
R Data I/O, Access, and Manipulation7 September 29, 2010
Data Frame Input/Output
R Data I/O, Access, and Manipulation8 September 29, 2010
Basic CSV Data Input
To read the contents of a CSV file into an R data frame named ds, use the command
ds <- read.csv(file, header, …)
header is TRUE by default, indicating that the first row of the CSV file contains the row names.
file is the name of the file, enclosed in single or double quotes.
Example:
ds <- read.csv("C:/data/file.csv", header=TRUE)
R Data I/O, Access, and Manipulation9 September 29, 2010
Important Tips
When specifying file paths, use front slashes or double backslashes. (The single backslash is a special character in R.)
Works:ds <- read.csv("C:/data/file.csv")
Works:ds <- read.csv("C:\\data\\file.csv")
Fails:ds <- read.csv("C:\data\file.csv")
R Data I/O, Access, and Manipulation10 September 29, 2010
Other Delimited Text Files
To input a text data table delimited with characters other than commas, use the command
ds <- read.table(file, header, sep, …)
sep specifies the delimiter:"," indicates a comma"\t" indicates the tab character
For example:
ds <- read.table("C:/file.txt", sep="\t")
R Data I/O, Access, and Manipulation11 September 29, 2010
Important Tips
The logical values TRUE and FALSE must be all caps.
If a data frame with named ds already exists, the command ds <- read.csv(…) or any other command using ds on the left side of the assignment operator <- will overwrite ds if it executes successfully.
Refer to the R Documentation page on read.table(…) for more detailed information and for other options such as ignoring comment headers and using special quotation characters.
R Data I/O, Access, and Manipulation12 September 29, 2010
CSV Data Output
To write the contents of a data frame named ds to a CSV file, use the command
write.csv(ds, file, …)
For example:
write.csv(ds, "C:/data/file.csv")
To output a file delimited by a character other than the comma, use the command
write.table(ds, file, … , sep)
R Data I/O, Access, and Manipulation13 September 29, 2010
Important Tips
The functions write.csv(…) and write.table(…) have many options, including col.names and row.names, which allow users to choose whether to use column naming and/or row numbering.
Refer to the R Documentation on write.table(…) for more information.
R Data I/O, Access, and Manipulation14 September 29, 2010
Databases
R has simple facilities for querying databases and filling a data frame with the results of your query.
R can query MySQL databases using the R package RMySQL.
R can query Oracle databases using the R package ROracle.
Queries to either database type require the R package DBI.
R Data I/O, Access, and Manipulation15 September 29, 2010
MySQL Databases
To fill a data frame ds with the results of a SQL query against a MySQL database, use the following template R code:
library(DBI)library(RMySQL)db_name <- "database_name"db_node <- "database_node"db_user <- "username"db_pw <- "password"mysql <- dbDriver("MySQL")sql_statement <- "select … from …"con <- dbConnect(mysql, user=db_user, password=db_pw,
dbname=db_name, host=db_node)ds <- dbGetQuery(con, sql_statement)mysqlCloseConnection(con)
R Data I/O, Access, and Manipulation16 September 29, 2010
Oracle Databases
To fill a data frame ds with the results of a SQL query against an Oracle database, use the following template R code:
library(DBI)library(ROracle)db_name <- "database_name"db_user <- "username"db_pw <- "password"ora <- dbDriver("Oracle")sql_statement <- "select … from …"con <- dbConnect(ora, user=db_user, password=db_pw, dbname=db_name)ds <- dbGetQuery(con, sql_statement)dbDisconnect(con)
R Data I/O, Access, and Manipulation17 September 29, 2010
Data Frame Access and Manipulation
R Data I/O, Access, and Manipulation18 September 29, 2010
Accessing Columns in a Data Frame
Each column in a data frame represents a variable. Different columns may have different data types (e.g. character strings in column 1, numerical values in column 2, logical values in column 3).
Columns inside a data frame can be accessed in any of three basic methods:
• Dollar sign extraction operator $
• Square brackets extraction operator []
• subset() function
R Data I/O, Access, and Manipulation19 September 29, 2010
Dollar Sign Extraction Operator
A single column from a data frame can be accessed using the dollar sign operator $ as follows. To return a vector containing the data in the column named SIDD in the data frame named ds, issue the command
ds$SIDD
Do not surround the name of the column in quotes when using the $ operator.
R Data I/O, Access, and Manipulation20 September 29, 2010
Square Brackets Extraction Operator
A single column from a data frame may also be accessed using the square brackets operator [] as follows. To return a vector containing the column named SIDD in the data frame named ds, issue the command
ds[,"SIDD"]
You must surround the name of the column in double or single quotes when using the [] operator.
The comma before the column name is important, as you will see several slides ahead.
R Data I/O, Access, and Manipulation21 September 29, 2010
subset() Function
A third way to access a single column in a data frame utilizes R’s subset() function. To return a vector containing the column named SIDD in the data frame named ds, issue the command
subset(ds, select="SIDD")
R Data I/O, Access, and Manipulation22 September 29, 2010
Numerical Indices
R indexes data containers with integers, beginning at 1.
This is unlike most programming languages, in which indices begin at 0.
The square brackets extraction operator also accepts the number of the column. If the third column in the data frame ds is named SIDD, then
ds[,"SIDD"] and ds[,3]
are equivalent commands.
R Data I/O, Access, and Manipulation23 September 29, 2010
Accessing Rows in a Data Frame
The rows of a data frame are not generally named, but are numbered beginning at 1.
The rows of a data frame can be accessed by either of two methods:
• Square brackets extraction operator []
• subset() function
R Data I/O, Access, and Manipulation24 September 29, 2010
Square Brackets Extraction Operator
To return a vector containing the nth row of a data frame ds, issue the command
ds[n,]
The comma after the column name is important. The square brackets expect a row number before the comma, and a column name or number after the comma.
R Data I/O, Access, and Manipulation25 September 29, 2010
Square Brackets Extraction Operator
Square brackets can also be used to return multiple rows of a data frame. To return a smaller data frame containing the nth through n+mth rows of a data frame ds, issue the command
ds[n:(n+m),]
The above command also demonstrates the colon operator :, which is used to create sequences of integer numbers, in this case beginning with n and ending with n+m.
R Data I/O, Access, and Manipulation26 September 29, 2010
subset() Function
The subset() function is sometimes useful in returning multiple rows of a data frame. It is more complicated to use than the square brackets.
For example, to extract the 2nd, 4th, and 5th rows of a data frame with 5 rows, we could issue the commands:
index <- c(FALSE, TRUE, FALSE, TRUE, TRUE)subset(ds, subset=index)
R Data I/O, Access, and Manipulation27 September 29, 2010
Square Brackets Extraction Operator
An individual scalar entry within a data frame can be returned by using the square bracket operators, with numbers on both sides of the comma.
To return the scalar value in the mth row and nth column of a data frame ds, issue the command
ds[m,n]
To return the scalar value in the mth row of the data frame ds, in the column named SIDD, issue the command
ds[m,"SIDD"]
R Data I/O, Access, and Manipulation28 September 29, 2010
Assignment with [] and $
The square brackets and dollar sign can also be used to assign values within a data frame. If the column SIDD in the data frame ds contains numerical data, we can multiply the 5th entry in the SIDD column by two by issuing the command
ds[5,"SIDD"] <- 2 * ds[5,"SIDD"]
We could create a new column (or replace the values within the column) named TWICE_SIDD in the data frame ds, and fill it with values twice those in the column SIDD, by issuing the command
ds$TWICE_SIDD <- 2 * ds$SIDD
R Data I/O, Access, and Manipulation29 September 29, 2010
Dimensions
Commands to return the dimensions of a data frame ds are
dim(ds) nrow(ds) ncol(ds)
dim(ds) returns a vector of length two containing the number of rows in position 1 and the number of columns in position 2.
The command to return the length of a vector v is
length(v)
R Data I/O, Access, and Manipulation30 September 29, 2010
Factors
By default, R stores the character vector columns in data frames as factors. In R, a factor is an indexed vector.
To factor a vector, R identifies the unique entries in the vector and makes them the levels of the factor. Each vector entry is then indexed by an integer to one of the factor levels. This saves memory when the entries in a vector are not all unique.
There are several functions to handle factors. Refer to the R Documentation or Help pages about factors.
R Data I/O, Access, and Manipulation31 September 29, 2010
Connections and Line-by-Line Text Input/Output
R Data I/O, Access, and Manipulation32 September 29, 2010
Connections
In some cases, it is preferable to import or export data line-by-line.
Line-by-line data input/output reduces R’s memory usage and is useful when dealing with very large delimited text datasets.
Line-by-line text input/output can be useful for reading and writing log files.
The first step in reading line-by-line is opening a file connection.
R Data I/O, Access, and Manipulation33 September 29, 2010
Connections for Input
R can open a text file connection conn for input using the command
conn <- file(filename, open="rt")
If the specified file exists and is accessible, then a connection is created and opened for text reading.
Example:
conn <- file("C:/data/in.txt", open="rt")
("rt" indicates “read text”)
R Data I/O, Access, and Manipulation34 September 29, 2010
Line-by-Line Input
Once a text file input connection is open, we can use one of R’s line-by-line text input functions: readLines(conn, n)
scan(…)
The scan(…) function is useful for importing delimited data files (e.g. CSV) line by line. The scan(…) function has many arguments. Refer to its lengthy R Documentation page for details.
The readLines(…) function is simpler and is useful for reading unstructured lines of text.
R Data I/O, Access, and Manipulation35 September 29, 2010
Line-by-Line Input
To read one line of text from a file into the scalar character array variable str, we could use the following series of commands
conn <- file("C:/data/in.txt", open="rt")str <- readLines(conn, n=1)close(conn)
The close(conn) command closes the connection, leaving the file intact, and leaving str in the R workspace.
R Data I/O, Access, and Manipulation36 September 29, 2010
Connections for Output
R can create a text file connection conn for output using the command
conn <- file(filename, open="wt")
If the file does not exist, it is created. If the file already exists, its contents are erased!
Example:
conn <- file("C:/data/out.txt", open="wt")
("wt" indicates “write text”)
R Data I/O, Access, and Manipulation37 September 29, 2010
Output to a Connection
Once a text file output connection is open, we can write text to the connection by making one or more calls to the R function
write("text to write", file=conn, append=TRUE)
Once finished writing text to the connection, close it using the command
close(conn)
R Data I/O, Access, and Manipulation38 September 29, 2010
Output to a File
R can also write directly to a file without creating a connection. In this example, we retain the contents of an existing text file and append new text.
To write the contents of the character string str to a file, issue the command
write(str, file=filename, append=TRUE)
Example:
str <- "some text to output \nline 2"write(str, file="C:/data/out.txt", append=TRUE)
R Data I/O, Access, and Manipulation39 September 29, 2010
Output to a File
If the specified file does not exist, the write() command will create it.
Be sure to use the append=TRUE option when appending to an existing text file, or the file’s contents will be cleared!
There is no need to use the close() command after writing to a file without using a connection, because no persistent connection has been opened.
Use the newline character \n to create line breaks in text output.
R Data I/O, Access, and Manipulation40 September 29, 2010
Gzip Connections
R provides facilities for line-by-line reading and writing of files compressed by the gzip utility.
To create a connection to a gzip file for reading, issue the command
conn <- gzfile(filename, open="rt")
To create a connection to a gzip file for writing, issue the command
conn <- gzfile(filename, open="wt")
The readLines(), write(), and close() functions can be used in the same way as with text file connections.