r text-based data i/o and data frame access and manupulation

40
R Text-Based Data I/O R Data Frame Access and Manipulation Ian M. Cook September 29, 2010

Upload: ian-cook

Post on 01-Jul-2015

735 views

Category:

Technology


0 download

DESCRIPTION

A tutorial intended primarily for beginners covering the classic methods of getting data into and out of R and accessing data in data frames in R.

TRANSCRIPT

Page 1: R Text-Based Data I/O and Data Frame Access and Manupulation

R Text-Based Data I/OR Data Frame Access and Manipulation

Ian M. Cook

September 29, 2010

Page 2: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation2 September 29, 2010

Background Information

Page 3: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation3 September 29, 2010

Data Types

R has several important data types:

• numeric(stores integers and floating point real numbers)

• character(stores strings of characters, not single characters)

• logical(stores TRUE or FALSE)

Page 4: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation4 September 29, 2010

Data Containers

The most basic data storage container in R is a scalar, a 1x1 unit of data. A scalar might contain a unit of numeric, character, or logical data.

A 1-dimensional array of scalars in R is a vector.

A 2-dimensional array of scalars in R can be a matrix or a data frame. (The focus here is on data frames. Matrices are often less useful and less accessible so are not covered in this presentation.)

R also has other data containers, including lists, which are important to know about but are often less useful for data analysis purposes.

Page 5: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation5 September 29, 2010

Data Containers

A vector can be created in R using the function c(). To create several vectors of various lengths containing numerical, character, and logical data, we can enter

v1 <- c(1, 3, 9, 3.14159, -88.1, 0)v2 <- c("abc","def","ghi")v3 <- c(TRUE, FALSE, TRUE, TRUE)

Data types cannot be mixed within a vector. Entering mixed data types into a vector using the c() function converts all non-character entries into character representations.

Page 6: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation6 September 29, 2010

Data Frames

A data frame is a rectangular array, with each column representing a variable.

Different columns in a data frame may have different data types. (E.g. a data frame might have character strings in column 1, numerical values in column 2, and logical values in column 3.)

A data frame can be created in R using the function data.frame(), but it is often more useful to input a data frame from an external data file or database.

Page 7: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation7 September 29, 2010

Data Frame Input/Output

Page 8: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation8 September 29, 2010

Basic CSV Data Input

To read the contents of a CSV file into an R data frame named ds, use the command

ds <- read.csv(file, header, …)

header is TRUE by default, indicating that the first row of the CSV file contains the row names.

file is the name of the file, enclosed in single or double quotes.

Example:

ds <- read.csv("C:/data/file.csv", header=TRUE)

Page 9: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation9 September 29, 2010

Important Tips

When specifying file paths, use front slashes or double backslashes. (The single backslash is a special character in R.)

Works:ds <- read.csv("C:/data/file.csv")

Works:ds <- read.csv("C:\\data\\file.csv")

Fails:ds <- read.csv("C:\data\file.csv")

Page 10: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation10 September 29, 2010

Other Delimited Text Files

To input a text data table delimited with characters other than commas, use the command

ds <- read.table(file, header, sep, …)

sep specifies the delimiter:"," indicates a comma"\t" indicates the tab character

For example:

ds <- read.table("C:/file.txt", sep="\t")

Page 11: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation11 September 29, 2010

Important Tips

The logical values TRUE and FALSE must be all caps.

If a data frame with named ds already exists, the command ds <- read.csv(…) or any other command using ds on the left side of the assignment operator <- will overwrite ds if it executes successfully.

Refer to the R Documentation page on read.table(…) for more detailed information and for other options such as ignoring comment headers and using special quotation characters.

Page 12: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation12 September 29, 2010

CSV Data Output

To write the contents of a data frame named ds to a CSV file, use the command

write.csv(ds, file, …)

For example:

write.csv(ds, "C:/data/file.csv")

To output a file delimited by a character other than the comma, use the command

write.table(ds, file, … , sep)

Page 13: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation13 September 29, 2010

Important Tips

The functions write.csv(…) and write.table(…) have many options, including col.names and row.names, which allow users to choose whether to use column naming and/or row numbering.

Refer to the R Documentation on write.table(…) for more information.

Page 14: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation14 September 29, 2010

Databases

R has simple facilities for querying databases and filling a data frame with the results of your query.

R can query MySQL databases using the R package RMySQL.

R can query Oracle databases using the R package ROracle.

Queries to either database type require the R package DBI.

Page 15: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation15 September 29, 2010

MySQL Databases

To fill a data frame ds with the results of a SQL query against a MySQL database, use the following template R code:

library(DBI)library(RMySQL)db_name <- "database_name"db_node <- "database_node"db_user <- "username"db_pw <- "password"mysql <- dbDriver("MySQL")sql_statement <- "select … from …"con <- dbConnect(mysql, user=db_user, password=db_pw,

dbname=db_name, host=db_node)ds <- dbGetQuery(con, sql_statement)mysqlCloseConnection(con)

Page 16: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation16 September 29, 2010

Oracle Databases

To fill a data frame ds with the results of a SQL query against an Oracle database, use the following template R code:

library(DBI)library(ROracle)db_name <- "database_name"db_user <- "username"db_pw <- "password"ora <- dbDriver("Oracle")sql_statement <- "select … from …"con <- dbConnect(ora, user=db_user, password=db_pw, dbname=db_name)ds <- dbGetQuery(con, sql_statement)dbDisconnect(con)

Page 17: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation17 September 29, 2010

Data Frame Access and Manipulation

Page 18: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation18 September 29, 2010

Accessing Columns in a Data Frame

Each column in a data frame represents a variable. Different columns may have different data types (e.g. character strings in column 1, numerical values in column 2, logical values in column 3).

Columns inside a data frame can be accessed in any of three basic methods:

• Dollar sign extraction operator $

• Square brackets extraction operator []

• subset() function

Page 19: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation19 September 29, 2010

Dollar Sign Extraction Operator

A single column from a data frame can be accessed using the dollar sign operator $ as follows. To return a vector containing the data in the column named SIDD in the data frame named ds, issue the command

ds$SIDD

Do not surround the name of the column in quotes when using the $ operator.

Page 20: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation20 September 29, 2010

Square Brackets Extraction Operator

A single column from a data frame may also be accessed using the square brackets operator [] as follows. To return a vector containing the column named SIDD in the data frame named ds, issue the command

ds[,"SIDD"]

You must surround the name of the column in double or single quotes when using the [] operator.

The comma before the column name is important, as you will see several slides ahead.

Page 21: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation21 September 29, 2010

subset() Function

A third way to access a single column in a data frame utilizes R’s subset() function. To return a vector containing the column named SIDD in the data frame named ds, issue the command

subset(ds, select="SIDD")

Page 22: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation22 September 29, 2010

Numerical Indices

R indexes data containers with integers, beginning at 1.

This is unlike most programming languages, in which indices begin at 0.

The square brackets extraction operator also accepts the number of the column. If the third column in the data frame ds is named SIDD, then

ds[,"SIDD"] and ds[,3]

are equivalent commands.

Page 23: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation23 September 29, 2010

Accessing Rows in a Data Frame

The rows of a data frame are not generally named, but are numbered beginning at 1.

The rows of a data frame can be accessed by either of two methods:

• Square brackets extraction operator []

• subset() function

Page 24: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation24 September 29, 2010

Square Brackets Extraction Operator

To return a vector containing the nth row of a data frame ds, issue the command

ds[n,]

The comma after the column name is important. The square brackets expect a row number before the comma, and a column name or number after the comma.

Page 25: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation25 September 29, 2010

Square Brackets Extraction Operator

Square brackets can also be used to return multiple rows of a data frame. To return a smaller data frame containing the nth through n+mth rows of a data frame ds, issue the command

ds[n:(n+m),]

The above command also demonstrates the colon operator :, which is used to create sequences of integer numbers, in this case beginning with n and ending with n+m.

Page 26: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation26 September 29, 2010

subset() Function

The subset() function is sometimes useful in returning multiple rows of a data frame. It is more complicated to use than the square brackets.

For example, to extract the 2nd, 4th, and 5th rows of a data frame with 5 rows, we could issue the commands:

index <- c(FALSE, TRUE, FALSE, TRUE, TRUE)subset(ds, subset=index)

Page 27: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation27 September 29, 2010

Square Brackets Extraction Operator

An individual scalar entry within a data frame can be returned by using the square bracket operators, with numbers on both sides of the comma.

To return the scalar value in the mth row and nth column of a data frame ds, issue the command

ds[m,n]

To return the scalar value in the mth row of the data frame ds, in the column named SIDD, issue the command

ds[m,"SIDD"]

Page 28: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation28 September 29, 2010

Assignment with [] and $

The square brackets and dollar sign can also be used to assign values within a data frame. If the column SIDD in the data frame ds contains numerical data, we can multiply the 5th entry in the SIDD column by two by issuing the command

ds[5,"SIDD"] <- 2 * ds[5,"SIDD"]

We could create a new column (or replace the values within the column) named TWICE_SIDD in the data frame ds, and fill it with values twice those in the column SIDD, by issuing the command

ds$TWICE_SIDD <- 2 * ds$SIDD

Page 29: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation29 September 29, 2010

Dimensions

Commands to return the dimensions of a data frame ds are

dim(ds) nrow(ds) ncol(ds)

dim(ds) returns a vector of length two containing the number of rows in position 1 and the number of columns in position 2.

The command to return the length of a vector v is

length(v)

Page 30: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation30 September 29, 2010

Factors

By default, R stores the character vector columns in data frames as factors. In R, a factor is an indexed vector.

To factor a vector, R identifies the unique entries in the vector and makes them the levels of the factor. Each vector entry is then indexed by an integer to one of the factor levels. This saves memory when the entries in a vector are not all unique.

There are several functions to handle factors. Refer to the R Documentation or Help pages about factors.

Page 31: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation31 September 29, 2010

Connections and Line-by-Line Text Input/Output

Page 32: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation32 September 29, 2010

Connections

In some cases, it is preferable to import or export data line-by-line.

Line-by-line data input/output reduces R’s memory usage and is useful when dealing with very large delimited text datasets.

Line-by-line text input/output can be useful for reading and writing log files.

The first step in reading line-by-line is opening a file connection.

Page 33: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation33 September 29, 2010

Connections for Input

R can open a text file connection conn for input using the command

conn <- file(filename, open="rt")

If the specified file exists and is accessible, then a connection is created and opened for text reading.

Example:

conn <- file("C:/data/in.txt", open="rt")

("rt" indicates “read text”)

Page 34: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation34 September 29, 2010

Line-by-Line Input

Once a text file input connection is open, we can use one of R’s line-by-line text input functions: readLines(conn, n)

scan(…)

The scan(…) function is useful for importing delimited data files (e.g. CSV) line by line. The scan(…) function has many arguments. Refer to its lengthy R Documentation page for details.

The readLines(…) function is simpler and is useful for reading unstructured lines of text.

Page 35: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation35 September 29, 2010

Line-by-Line Input

To read one line of text from a file into the scalar character array variable str, we could use the following series of commands

conn <- file("C:/data/in.txt", open="rt")str <- readLines(conn, n=1)close(conn)

The close(conn) command closes the connection, leaving the file intact, and leaving str in the R workspace.

Page 36: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation36 September 29, 2010

Connections for Output

R can create a text file connection conn for output using the command

conn <- file(filename, open="wt")

If the file does not exist, it is created. If the file already exists, its contents are erased!

Example:

conn <- file("C:/data/out.txt", open="wt")

("wt" indicates “write text”)

Page 37: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation37 September 29, 2010

Output to a Connection

Once a text file output connection is open, we can write text to the connection by making one or more calls to the R function

write("text to write", file=conn, append=TRUE)

Once finished writing text to the connection, close it using the command

close(conn)

Page 38: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation38 September 29, 2010

Output to a File

R can also write directly to a file without creating a connection. In this example, we retain the contents of an existing text file and append new text.

To write the contents of the character string str to a file, issue the command

write(str, file=filename, append=TRUE)

Example:

str <- "some text to output \nline 2"write(str, file="C:/data/out.txt", append=TRUE)

Page 39: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation39 September 29, 2010

Output to a File

If the specified file does not exist, the write() command will create it.

Be sure to use the append=TRUE option when appending to an existing text file, or the file’s contents will be cleared!

There is no need to use the close() command after writing to a file without using a connection, because no persistent connection has been opened.

Use the newline character \n to create line breaks in text output.

Page 40: R Text-Based Data I/O and Data Frame Access and Manupulation

R Data I/O, Access, and Manipulation40 September 29, 2010

Gzip Connections

R provides facilities for line-by-line reading and writing of files compressed by the gzip utility.

To create a connection to a gzip file for reading, issue the command

conn <- gzfile(filename, open="rt")

To create a connection to a gzip file for writing, issue the command

conn <- gzfile(filename, open="wt")

The readLines(), write(), and close() functions can be used in the same way as with text file connections.