r text-based data i/o and data frame access and manupulation

Post on 01-Jul-2015

735 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A tutorial intended primarily for beginners covering the classic methods of getting data into and out of R and accessing data in data frames in R.

TRANSCRIPT

R Text-Based Data I/OR Data Frame Access and Manipulation

Ian M. Cook

September 29, 2010

R Data I/O, Access, and Manipulation2 September 29, 2010

Background Information

R Data I/O, Access, and Manipulation3 September 29, 2010

Data Types

R has several important data types:

• numeric(stores integers and floating point real numbers)

• character(stores strings of characters, not single characters)

• logical(stores TRUE or FALSE)

R Data I/O, Access, and Manipulation4 September 29, 2010

Data Containers

The most basic data storage container in R is a scalar, a 1x1 unit of data. A scalar might contain a unit of numeric, character, or logical data.

A 1-dimensional array of scalars in R is a vector.

A 2-dimensional array of scalars in R can be a matrix or a data frame. (The focus here is on data frames. Matrices are often less useful and less accessible so are not covered in this presentation.)

R also has other data containers, including lists, which are important to know about but are often less useful for data analysis purposes.

R Data I/O, Access, and Manipulation5 September 29, 2010

Data Containers

A vector can be created in R using the function c(). To create several vectors of various lengths containing numerical, character, and logical data, we can enter

v1 <- c(1, 3, 9, 3.14159, -88.1, 0)v2 <- c("abc","def","ghi")v3 <- c(TRUE, FALSE, TRUE, TRUE)

Data types cannot be mixed within a vector. Entering mixed data types into a vector using the c() function converts all non-character entries into character representations.

R Data I/O, Access, and Manipulation6 September 29, 2010

Data Frames

A data frame is a rectangular array, with each column representing a variable.

Different columns in a data frame may have different data types. (E.g. a data frame might have character strings in column 1, numerical values in column 2, and logical values in column 3.)

A data frame can be created in R using the function data.frame(), but it is often more useful to input a data frame from an external data file or database.

R Data I/O, Access, and Manipulation7 September 29, 2010

Data Frame Input/Output

R Data I/O, Access, and Manipulation8 September 29, 2010

Basic CSV Data Input

To read the contents of a CSV file into an R data frame named ds, use the command

ds <- read.csv(file, header, …)

header is TRUE by default, indicating that the first row of the CSV file contains the row names.

file is the name of the file, enclosed in single or double quotes.

Example:

ds <- read.csv("C:/data/file.csv", header=TRUE)

R Data I/O, Access, and Manipulation9 September 29, 2010

Important Tips

When specifying file paths, use front slashes or double backslashes. (The single backslash is a special character in R.)

Works:ds <- read.csv("C:/data/file.csv")

Works:ds <- read.csv("C:\\data\\file.csv")

Fails:ds <- read.csv("C:\data\file.csv")

R Data I/O, Access, and Manipulation10 September 29, 2010

Other Delimited Text Files

To input a text data table delimited with characters other than commas, use the command

ds <- read.table(file, header, sep, …)

sep specifies the delimiter:"," indicates a comma"\t" indicates the tab character

For example:

ds <- read.table("C:/file.txt", sep="\t")

R Data I/O, Access, and Manipulation11 September 29, 2010

Important Tips

The logical values TRUE and FALSE must be all caps.

If a data frame with named ds already exists, the command ds <- read.csv(…) or any other command using ds on the left side of the assignment operator <- will overwrite ds if it executes successfully.

Refer to the R Documentation page on read.table(…) for more detailed information and for other options such as ignoring comment headers and using special quotation characters.

R Data I/O, Access, and Manipulation12 September 29, 2010

CSV Data Output

To write the contents of a data frame named ds to a CSV file, use the command

write.csv(ds, file, …)

For example:

write.csv(ds, "C:/data/file.csv")

To output a file delimited by a character other than the comma, use the command

write.table(ds, file, … , sep)

R Data I/O, Access, and Manipulation13 September 29, 2010

Important Tips

The functions write.csv(…) and write.table(…) have many options, including col.names and row.names, which allow users to choose whether to use column naming and/or row numbering.

Refer to the R Documentation on write.table(…) for more information.

R Data I/O, Access, and Manipulation14 September 29, 2010

Databases

R has simple facilities for querying databases and filling a data frame with the results of your query.

R can query MySQL databases using the R package RMySQL.

R can query Oracle databases using the R package ROracle.

Queries to either database type require the R package DBI.

R Data I/O, Access, and Manipulation15 September 29, 2010

MySQL Databases

To fill a data frame ds with the results of a SQL query against a MySQL database, use the following template R code:

library(DBI)library(RMySQL)db_name <- "database_name"db_node <- "database_node"db_user <- "username"db_pw <- "password"mysql <- dbDriver("MySQL")sql_statement <- "select … from …"con <- dbConnect(mysql, user=db_user, password=db_pw,

dbname=db_name, host=db_node)ds <- dbGetQuery(con, sql_statement)mysqlCloseConnection(con)

R Data I/O, Access, and Manipulation16 September 29, 2010

Oracle Databases

To fill a data frame ds with the results of a SQL query against an Oracle database, use the following template R code:

library(DBI)library(ROracle)db_name <- "database_name"db_user <- "username"db_pw <- "password"ora <- dbDriver("Oracle")sql_statement <- "select … from …"con <- dbConnect(ora, user=db_user, password=db_pw, dbname=db_name)ds <- dbGetQuery(con, sql_statement)dbDisconnect(con)

R Data I/O, Access, and Manipulation17 September 29, 2010

Data Frame Access and Manipulation

R Data I/O, Access, and Manipulation18 September 29, 2010

Accessing Columns in a Data Frame

Each column in a data frame represents a variable. Different columns may have different data types (e.g. character strings in column 1, numerical values in column 2, logical values in column 3).

Columns inside a data frame can be accessed in any of three basic methods:

• Dollar sign extraction operator $

• Square brackets extraction operator []

• subset() function

R Data I/O, Access, and Manipulation19 September 29, 2010

Dollar Sign Extraction Operator

A single column from a data frame can be accessed using the dollar sign operator $ as follows. To return a vector containing the data in the column named SIDD in the data frame named ds, issue the command

ds$SIDD

Do not surround the name of the column in quotes when using the $ operator.

R Data I/O, Access, and Manipulation20 September 29, 2010

Square Brackets Extraction Operator

A single column from a data frame may also be accessed using the square brackets operator [] as follows. To return a vector containing the column named SIDD in the data frame named ds, issue the command

ds[,"SIDD"]

You must surround the name of the column in double or single quotes when using the [] operator.

The comma before the column name is important, as you will see several slides ahead.

R Data I/O, Access, and Manipulation21 September 29, 2010

subset() Function

A third way to access a single column in a data frame utilizes R’s subset() function. To return a vector containing the column named SIDD in the data frame named ds, issue the command

subset(ds, select="SIDD")

R Data I/O, Access, and Manipulation22 September 29, 2010

Numerical Indices

R indexes data containers with integers, beginning at 1.

This is unlike most programming languages, in which indices begin at 0.

The square brackets extraction operator also accepts the number of the column. If the third column in the data frame ds is named SIDD, then

ds[,"SIDD"] and ds[,3]

are equivalent commands.

R Data I/O, Access, and Manipulation23 September 29, 2010

Accessing Rows in a Data Frame

The rows of a data frame are not generally named, but are numbered beginning at 1.

The rows of a data frame can be accessed by either of two methods:

• Square brackets extraction operator []

• subset() function

R Data I/O, Access, and Manipulation24 September 29, 2010

Square Brackets Extraction Operator

To return a vector containing the nth row of a data frame ds, issue the command

ds[n,]

The comma after the column name is important. The square brackets expect a row number before the comma, and a column name or number after the comma.

R Data I/O, Access, and Manipulation25 September 29, 2010

Square Brackets Extraction Operator

Square brackets can also be used to return multiple rows of a data frame. To return a smaller data frame containing the nth through n+mth rows of a data frame ds, issue the command

ds[n:(n+m),]

The above command also demonstrates the colon operator :, which is used to create sequences of integer numbers, in this case beginning with n and ending with n+m.

R Data I/O, Access, and Manipulation26 September 29, 2010

subset() Function

The subset() function is sometimes useful in returning multiple rows of a data frame. It is more complicated to use than the square brackets.

For example, to extract the 2nd, 4th, and 5th rows of a data frame with 5 rows, we could issue the commands:

index <- c(FALSE, TRUE, FALSE, TRUE, TRUE)subset(ds, subset=index)

R Data I/O, Access, and Manipulation27 September 29, 2010

Square Brackets Extraction Operator

An individual scalar entry within a data frame can be returned by using the square bracket operators, with numbers on both sides of the comma.

To return the scalar value in the mth row and nth column of a data frame ds, issue the command

ds[m,n]

To return the scalar value in the mth row of the data frame ds, in the column named SIDD, issue the command

ds[m,"SIDD"]

R Data I/O, Access, and Manipulation28 September 29, 2010

Assignment with [] and $

The square brackets and dollar sign can also be used to assign values within a data frame. If the column SIDD in the data frame ds contains numerical data, we can multiply the 5th entry in the SIDD column by two by issuing the command

ds[5,"SIDD"] <- 2 * ds[5,"SIDD"]

We could create a new column (or replace the values within the column) named TWICE_SIDD in the data frame ds, and fill it with values twice those in the column SIDD, by issuing the command

ds$TWICE_SIDD <- 2 * ds$SIDD

R Data I/O, Access, and Manipulation29 September 29, 2010

Dimensions

Commands to return the dimensions of a data frame ds are

dim(ds) nrow(ds) ncol(ds)

dim(ds) returns a vector of length two containing the number of rows in position 1 and the number of columns in position 2.

The command to return the length of a vector v is

length(v)

R Data I/O, Access, and Manipulation30 September 29, 2010

Factors

By default, R stores the character vector columns in data frames as factors. In R, a factor is an indexed vector.

To factor a vector, R identifies the unique entries in the vector and makes them the levels of the factor. Each vector entry is then indexed by an integer to one of the factor levels. This saves memory when the entries in a vector are not all unique.

There are several functions to handle factors. Refer to the R Documentation or Help pages about factors.

R Data I/O, Access, and Manipulation31 September 29, 2010

Connections and Line-by-Line Text Input/Output

R Data I/O, Access, and Manipulation32 September 29, 2010

Connections

In some cases, it is preferable to import or export data line-by-line.

Line-by-line data input/output reduces R’s memory usage and is useful when dealing with very large delimited text datasets.

Line-by-line text input/output can be useful for reading and writing log files.

The first step in reading line-by-line is opening a file connection.

R Data I/O, Access, and Manipulation33 September 29, 2010

Connections for Input

R can open a text file connection conn for input using the command

conn <- file(filename, open="rt")

If the specified file exists and is accessible, then a connection is created and opened for text reading.

Example:

conn <- file("C:/data/in.txt", open="rt")

("rt" indicates “read text”)

R Data I/O, Access, and Manipulation34 September 29, 2010

Line-by-Line Input

Once a text file input connection is open, we can use one of R’s line-by-line text input functions: readLines(conn, n)

scan(…)

The scan(…) function is useful for importing delimited data files (e.g. CSV) line by line. The scan(…) function has many arguments. Refer to its lengthy R Documentation page for details.

The readLines(…) function is simpler and is useful for reading unstructured lines of text.

R Data I/O, Access, and Manipulation35 September 29, 2010

Line-by-Line Input

To read one line of text from a file into the scalar character array variable str, we could use the following series of commands

conn <- file("C:/data/in.txt", open="rt")str <- readLines(conn, n=1)close(conn)

The close(conn) command closes the connection, leaving the file intact, and leaving str in the R workspace.

R Data I/O, Access, and Manipulation36 September 29, 2010

Connections for Output

R can create a text file connection conn for output using the command

conn <- file(filename, open="wt")

If the file does not exist, it is created. If the file already exists, its contents are erased!

Example:

conn <- file("C:/data/out.txt", open="wt")

("wt" indicates “write text”)

R Data I/O, Access, and Manipulation37 September 29, 2010

Output to a Connection

Once a text file output connection is open, we can write text to the connection by making one or more calls to the R function

write("text to write", file=conn, append=TRUE)

Once finished writing text to the connection, close it using the command

close(conn)

R Data I/O, Access, and Manipulation38 September 29, 2010

Output to a File

R can also write directly to a file without creating a connection. In this example, we retain the contents of an existing text file and append new text.

To write the contents of the character string str to a file, issue the command

write(str, file=filename, append=TRUE)

Example:

str <- "some text to output \nline 2"write(str, file="C:/data/out.txt", append=TRUE)

R Data I/O, Access, and Manipulation39 September 29, 2010

Output to a File

If the specified file does not exist, the write() command will create it.

Be sure to use the append=TRUE option when appending to an existing text file, or the file’s contents will be cleared!

There is no need to use the close() command after writing to a file without using a connection, because no persistent connection has been opened.

Use the newline character \n to create line breaks in text output.

R Data I/O, Access, and Manipulation40 September 29, 2010

Gzip Connections

R provides facilities for line-by-line reading and writing of files compressed by the gzip utility.

To create a connection to a gzip file for reading, issue the command

conn <- gzfile(filename, open="rt")

To create a connection to a gzip file for writing, issue the command

conn <- gzfile(filename, open="wt")

The readLines(), write(), and close() functions can be used in the same way as with text file connections.

top related