getting started with r - baruch...

Getting started with R

Sebastiano Manzan

ECO 4051 | Spring 2018

1 / 71

Why R?

I R is becoming increasingly popular in economics and finance

I It is open source, simple to use, and with numerous packages (12,126 as oftoday) contributed by a large community of users on any aspect ofstatistical modeling and data analysis

I If you can have an excellent free product why should you pay for anexcellent expensive product (e.g., Matlab, SAS)?

I Learning a programming language can be a useful skill in the current labormarket where companies are increasingly interested to extract information(intelligence) from large datasets they collect about their business

I Trend in the industry (just two example):I Microsoft bought Revolution R and renamed Microsoft R (a version

of R that is optimized to work on multiple cores)I IBM bought SPSS and many data/analytics providers (e.g., the

Weather Channel), and sponsors Cognitive Class.ai that offers freeonline courses about data science and machine learning

2 / 71

https://mran.microsoft.com/

https://cognitiveclass.ai/

Outline of the course

1. Getting started with R2. Linear Regression Model3. Time series models4. Volatility modeling5. High-frequency data6. Measuring Financial Risk

3 / 71

Lets get started with R

Figure 1: R for Windows

4 / 71

http://www.r-project.org/

Lets get started with Rstudio

Figure 2: Rstudio5 / 71

How R works

I R creates and works with objects that contain data

I The object/data can be of a different structure such as:I data frame: a table where each column represents a variable and

each row a different observation (different time period or unit); avariable can be numerical or a string (similar to an Excel spreadsheet)

I matrix: same as data frame, but all variables/columns have to be ofthe same type (typically all numbers)

I lists: an object of objects; each element of the list can be, e.g., a dataframe, a matrix, and a vector (similar to a set of Excel spreadsheets)

I Function: in R we can create functions that takes a set of arguments andperform a set of operations on a data object; e.g., mean(x, na.rm=T)

I Package: a group of functions with a specific purpose (e.g., ggplot2)I install a package: install.packages("ggplot2") (only done once)I use the package: library(ggplot2) or require(ggplot2)

6 / 71

Loading data in R

I It is convenient to start a R session by setting the working directory wherethe data/files are stored; for example:

I setwd('/Users/username/Baruch/ECO4051/') in Mac/UnixI setwd('c:/Baruch/ECO4051/') in Windows

I Two ways to load a dataset in R:

1. import the data from a local file2. import the data from an online resource (e.g., Yahoo Finance, FRED,

Google Finance, Quandl)

7 / 71

Base function read.csv()

I You can load a file from Rstudio via Tools -> Import Dataset and thenyou are given the option From Text File or From Web URL

I Otherwise, you can type a few lines of code (table from Wikipedia):

splist <- read.csv("List_SP500.csv")head(splist,10)

Ticker.symbol Security Address.of.Headquarters Date.first.added1 MMM 3M Company St. Paul, Minnesota2 ABT Abbott Laboratories North Chicago, Illinois 1964-03-313 ABBV AbbVie Inc. North Chicago, Illinois 2012-12-314 ACN Accenture plc Dublin, Ireland 2011-07-065 ATVI Activision Blizzard Santa Monica, California 2015-08-316 AYI Acuity Brands Inc Atlanta, Georgia 2016-05-037 ADBE Adobe Systems Inc San Jose, California 1997-05-058 AMD Advanced Micro Devices Inc Sunnyvale, California 2017-03-209 AAP Advance Auto Parts Roanoke, Virginia 2015-07-0910 AES AES Corp Arlington, Virginia

I The commands head( ,n) and tail( ,n) show the first and last nobservations

8 / 71

https://en.wikipedia.org/wiki/List_of_S%26P_500_companies

Data types

I The str() command can be used to evaluate the object structure and thedata types:

str(splist)

'data.frame': 505 obs. of 4 variables:$ Ticker.symbol : Factor w/ 505 levels "A","AAL","AAP",..: 314 7 5 8 52 58 9 33 3 17 ...$ Security : Factor w/ 505 levels "3M Company","A.O. Smith Corp",..: 1 3 4 5 6 7 8 10 9 11 ...$ Address.of.Headquarters: Factor w/ 256 levels "Akron, Ohio",..: 222 159 159 66 210 8 204 226 195 5 ...$ Date.first.added : Factor w/ 303 levels "","1964-03-31",..: 1 2 213 195 252 273 79 292 249 1 ...

I Each variable in the data frame splist has a type that can be:I numeric: (or double) is used for decimal valuesI integer: for integer valuesI character: for strings of charactersI Date: for datesI factor: represents a type of variable (either numeric, integer, or

character) that categorizes the values in a small (relative to the samplesize) set of categories (or levels)

9 / 71

I The read.csv() function has the annoying feature that any string isinterpreted as a factor

I This can be switched off by adding the argument stringsAsFactors = FALSE

splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)str(splist)

'data.frame': 505 obs. of 4 variables:$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois" "Dublin, Ireland" ...$ Date.first.added : chr "" "1964-03-31" "2012-12-31" "2011-07-06" ...

I The ticker symbol, security name, and address are all correctly interpretedas chr

I The date.first.added is also imported as a string, but we would like todefine it of type Date

10 / 71

I The code below is used to define the column/variable Date.first.added asa date with command as.Date()

splist$Date.first.added <- as.Date(splist$Date.first.added, format="%Y-%m-%d")str(splist)

'data.frame': 505 obs. of 4 variables:$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois" "Dublin, Ireland" ...$ Date.first.added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...

I $ sign is used to extract a variable/column in a data frame;splist$Date.first.added extract the Date.first.added from the dataframe object splist

I as.Date() function converts the splist$Date.first.added from characterto Date; the role of the argument format="%Y-%m-%d" is to specify theformat of the date being defined

11 / 71

read_csv() from readr package

I In addition to the base read.csv() function, there are other packages thatprovide functions to read data

I There are two problems with the read.csv() function:I type guessing (in particular dates)I reading speed

I The function read_csv() from package readr tries to solve both problems(will talk about speed later):

library(readr)splist <- read_csv("List_SP500.csv")str(splist, max.level=1)

Classes 'tbl_df', 'tbl' and 'data.frame': 505 obs. of 4 variables:$ Ticker symbol : chr "MMM" "ABT" "ABBV" "ACN" ...$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...$ Address of Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois" "Dublin, Ireland" ...$ Date first added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...- attr(*, "spec")=List of 2..- attr(*, "class")= chr "col_spec"

I Notice:I class tbl (tibble) and tbl_df is a type of data frame specific to this

packageI Date defined as a Date which saves us a line of code

12 / 71

I The file GSPC.csv represents daily data for the S&P 500 Index from January1985 downloaded from Yahoo Finance

I Below is a comparison of read.csv() and read_csv()

index <- read.csv("GSPC.csv", stringsAsFactors = FALSE)

'data.frame': 8237 obs. of 7 variables:$ Date : chr "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...$ GSPC.Open : num 167 165 165 164 164 ...$ GSPC.High : num 167 166 165 165 165 ...$ GSPC.Low : num 165 164 163 164 164 ...$ GSPC.Close : num 165 165 164 164 164 ...$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...$ GSPC.Adjusted: num 165 165 164 164 164 ...

index <- read_csv("GSPC.csv")

Classes 'tbl_df', 'tbl' and 'data.frame': 8237 obs. of 7 variables:$ Date : Date, format: "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...$ GSPC.Open : num 167 165 165 164 164 ...$ GSPC.High : num 167 166 165 165 165 ...$ GSPC.Low : num 165 164 163 164 164 ...$ GSPC.Close : num 165 165 164 164 164 ...$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...$ GSPC.Adjusted: num 165 165 164 164 164 ...- attr(*, "spec")=List of 2..- attr(*, "class")= chr "col_spec"

13 / 71

Saving data files

I In addition to read/import files, we might also need to save data files

I This can be done with the base function write.csv() (see help(write.csv)for the arguments)

index <- read_csv("GSPC.csv")write.csv(index, file = "myfile.csv", row.names = FALSE)

I The index object is saved to a file called myfile.csv in the working directory

14 / 71

Plotting the data . . .

I Visualization is an essential part of data analysisI It is useful to guide the analysis and to communicate our resultsI It is typically easier to process and understand a graph rather than a table

with numbersI The base function in R to plot data is plot() that takes as arguments:

I x, y: the variables to plot in the x-axis and y-axisI type: p for points, l for lines, b for both (and more)I xlim, ylim: the range of the axesI xlab, ylab: the labels of the axesI main: string to use for titleI col: color of the point and/or lineI pch: type of point to use

15 / 71

I The code below produces a time series plot of the S&P 500 Index:I The column index$Date is defined of class Date and used as the x-axisI The column index$GSPC.Adjusted is used as the y-axisI The left plot uses the default settings, the plot on the right has been

customized

# LEFT PLOTplot(index$Date, index$GSPC.Adjusted)# RIGHT PLOTplot(index$Date, index$GSPC.Adjusted, type="l", xlab="", ylab="S&P 500 Index", xaxt="n", yaxt="n")ticks <- seq(index$Date[1], index$Date[nrow(index)], by="year")axis(1, at=ticks, labels=ticks, cex.axis=0.9, col="orange", col.axis="blue")axis(2, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3,cex.axis=0.75,col.axis="purple")axis(4, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3, cex.axis=0.75,col.axis="purple")

1985 1990 1995 2000 2005 2010 2015

500

1500

2500

index$Date

inde

x$G

SP

C.A

djus

ted

S&

P 5

00 In

dex

1985−01−02 1995−01−02 2005−01−02 2015−01−02

500

1000

1500

2000

500

1000

1500

2000

16 / 71

Time series objects

I A variable that is observed over time is called a time series (e.g., stockprices, real GDP, inflation)

I There are several packages that provide infrastructures to define an objectas a time series object

I I will mostly use the xts package (that is part of the quantmod package forquantitative finance; other packages are ts and zoo)

I To define an object as a time series we use the command xts() that takestwo arguments:

I a data frameI a vector of dates (of class Date)

library(xts)index.xts <- xts(subset(index, select=-Date), order.by=index$Date)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted1985-01-02 167.20 167.20 165.19 165.37 67820000 165.371985-01-03 165.37 166.11 164.38 164.57 88880000 164.571985-01-04 164.55 164.55 163.36 163.68 77480000 163.681985-01-07 163.68 164.71 163.68 164.24 86190000 164.24

17 / 71

I The xts package provides functions to extract information from a timeseries object:

start(index.xts) # start dateend(index.xts) # end dateperiodicity(index.xts) # periodicity/frequency (daily, weekly, monthly)

[1] "1985-01-02"[1] "2017-09-01"Daily periodicity from 1985-01-02 to 2017-09-01

I There are also functions to aggregate the observations from high frequency(e.g., daily) to lower frequency (e.g., weekly/monthly/quarterly)

I By default the sub-sampling is performed by taking the first observation ofthe interval (e.g., monday of each week, 1st of the month)

index.weekly <- to.weekly(index.xts)

index.xts.Open index.xts.High index.xts.Low index.xts.Close index.xts.Volume index.xts.Adjusted1985-01-04 167.20 167.20 163.36 163.68 234180000 163.681985-01-11 163.68 168.72 163.68 167.91 509830000 167.911985-01-18 167.91 171.94 167.58 171.32 634000000 171.321985-01-25 171.32 178.16 171.31 177.35 749100000 177.35

18 / 71

I Functions apply.weekly() and apply.monthly() are used when the goal isto apply a function to each week/month in the sample

I In the examples below I apply these functions to subsample the first() andlast() day of the week (notice that the first() is equivalent to theto.weekly() function) and to calculate the mean() of the week

index.weekly <- apply.weekly(index.xts, "first")

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted1985-01-04 167.20 167.20 165.19 165.37 67820000 165.371985-01-11 163.68 164.71 163.68 164.24 86190000 164.24

index.weekly <- apply.weekly(index.xts, "last")


index.weekly <- apply.weekly(index.xts, "mean")


19 / 71

Plotting time series data

I Once the object is defined as xts, plotting is executed by plot.xts() whichhas the advantage that:

I the x-axis is by default defined to be timeI easier rendering of dates through the grids

I Using plot() on an xts object automatically calls plot.xts()

GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)plot(Ad(GSPC))

Jan 021985

Jan 041988

Jan 021992

Jan 021996

Jan 032000

Jan 022004

Jan 022008

Jan 032012

Jan 042016

500

1000

1500

2000

2500

Ad(GSPC)

20 / 71

Subsetting

I The xts package provides its own syntax to subset the time series objectI Below are some examples:

par(mfrow=c(2,3)) # organize the plots in 2 rows and 3 columnsplot(index.xts[’2007’])plot(index.xts[’2007/2009’])plot(index.xts[’/2007’])plot(index.xts[’2007/’])plot(index.xts[’2007-03-21/2008-02-12’])plot(index.xts[.indexwday(index.xts)==3])

Jan 032007

Apr 022007

Jul 022007

Oct 012007

Dec 312007

1400

1500

index.xts["2007"]

Jan 032007

Jan 022008

Jan 022009

Dec 312009

800

1200

1600

index.xts["2007/2009"]

Jan 021985

Jan 021990

Jan 031995

Jan 032000

Jan 032005

200

600

1000

1400

index.xts["/2007"]

Jan 032007

Jan 022009

Jan 032011

Jan 022013

Jan 022015

Jan 032017

1000

2000

index.xts["2007/"]

Mar 212007

Jun 012007

Aug 012007

Oct 012007

Dec 032007

Feb 122008

1350

1450

1550

index.xts["2007−03−21/2008−02−12"]

Jan 031985

Jan 021992

Jan 081998

Jan 082004

Jan 072010

Jan 072016

500

1500

2500

index.xts[.indexwday(index.xts) == 3]

21 / 71

getSymbols() from the quantmod package

I There are several packages that provide functions to download economic andfinancial data by only specifying the ticker and time period (and frequencyin some functions)

I I will discuss only the getSymbols() function from package quantmod()which will be used in this class

I Features of getSymbols():I Sources: Yahoo Finance, Google Finance, OANDA (fx rates), FRED

(argument src)I Download multiple tickers in one callI Select the time period (with arguments from and to)I By default the output is a xts object for each ticker specifiedI Yahoo Finance: downloads open, high, low, close, volume, and

adjusted close at the daily frequencyI You can convert to weekly or monthly use to.weekly()/to.monthly()

or the apply.weekly()/apply.monthly()

22 / 71

getSymbols() with one ticker

I By default, the function creates a xts object with the name of the ticker(except the ˆ part if you are downloading an index)

I When downloading only one ticker, setting auto.assign=FALSE allows you toassign the output to an object that you name (in the example below data)

library(quantmod)getSymbols("^GSPC", src = "yahoo", from = "1990-01-01")

[1] "GSPC"

tail(GSPC, 2)


data <- getSymbols("^GSPC", src = "yahoo", from = "1990-01-01", auto.assign = FALSE)tail(data, 2)


23 / 71

Multiple tickers

I A vector of symbols can be passed to getSymbols() to download data formultiple assets

I The function will create a xts object for each ticker (using as name theticker; the auto.assign options does not work for more than one ticker)

I When you download more than 5 symbols you will see a message pausing 1second between requests for more than 5 symbols

library(quantmod)getSymbols(c("^GSPC","^DJI"), src="yahoo", from="1990-01-01")periodicity(GSPC)periodicity(DJI)

[1] "GSPC" "DJI"Daily periodicity from 1990-01-02 to 2018-01-24Daily periodicity from 1990-01-02 to 2018-01-24

24 / 71

I getSymbols() allows you also to specify an environment (argument env)I Think of the environment as a folder in the R global environment where the

objects will be storedI Steps:

I create a new environment with new.env() command (called myenvbelow)

I call the getSymbols() function and set the env= argument to the newenvironment you created

I the ls() command below lists the objects in the new environmentmyenv

splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)myenv <- new.env()getSymbols(splist$Ticker.symbol[1:10], env=myenv, from="2010-01-01", src="yahoo")

[1] "MMM" "ABT" "ABBV" "ACN" "ATVI" "AYI" "ADBE" "AMD" "AAP" "AES"

ls(myenv)

[1] "AAP" "ABBV" "ABT" "ACN" "ADBE" "AES" "AMD" "ATVI" "AYI" "MMM"

25 / 71

quantmod functionalities

I The object created contains all the information, but for our analysis wemight need only some of the columns

I The package provides functions to extract the open price Op(), the closingprice Cl(), the highest intra-day price Hi(), the lowest Lo(), the volumeVo(), and the adjusted closing price Ad()

I OpCl() calculates the open-to-close daily return, ClCl() for the close-to-closereturn, and LoHi() for the low-to-high difference (also called the intra-dayrange)

data.new <- merge(Ad(GSPC),Ad(DJI))

GSPC.Adjusted DJI.Adjusted1990-01-02 359.69 2810.11990-01-03 358.76 2809.71990-01-04 355.67 2796.1

26 / 71

Oanda

I Daily exchange rates for a wide range of currency pairsI Limit of 2000 days for requestI command oanda.currencies gives you the symbols for 191 currencies

getSymbols(c("USD/EUR", "USD/JPY"), src="oanda")

[1] "USDEUR" "USDJPY"

par(mfrow=c(1,2))plot(USDEUR)plot(USDJPY)

Jul 302017

Sep 042017

Oct 092017

Nov 132017

Dec 182017

Jan 222018

0.81

0.83

0.85

USDEUR

Jul 302017

Sep 042017

Oct 092017

Nov 132017

Dec 182017

Jan 222018

108

110

112

114

USDJPY

27 / 71

FRED

I Federal Reserve Economic Data (FRED) can be used to downloadmacroeconomic time series for the US economy and also international

I Visit FRED to find out the symbol of the variable(s) you are interested todownload

I Below are some examples:

library(quantmod)macrodata <- getSymbols(c(’UNRATE’,’CPIAUCSL’,’GDPC1’), src="FRED")macrodata <- merge(UNRATE, CPIAUCSL, GDPC1)par(mfrow=c(1,3))plot(UNRATE); plot(CPIAUCSL); plot(GDPC1)

Jan1948

Jan1960

Jan1975

Jan1990

Jan2005

Dec2017

46

810

UNRATE

Jan1947

Jan1960

Jan1975

Jan1990

Jan2005

Dec2017

5010

015

020

025

0

CPIAUCSL

Jan1947

Jan1960

Jan1975

Jan1990

Jan2005

Jul2017

5000

1000

015

000

GDPC1

28 / 71

I Exchange rates are also available in FRED (no restriction on the timeperiod)

# DEXUSEU: U.S. Dollars to One Euro# DEXJPUS: Japanese Yen to One U.S. Dollarmacrodata <- getSymbols(c(’DEXUSEU’,’DEXJPUS’), src="FRED", from="1975-01-01")par(mfrow=c(1,2))plot(DEXUSEU)plot(DEXJPUS)

Jan 041999

Jan 012003

Jan 012007

Jan 032011

Jan 012015

0.8

1.0

1.2

1.4

1.6

DEXUSEU

Jan 041971

Jan 011980

Jan 011990

Jan 032000

Jan 012010

100

150

200

250

300

350

DEXJPUS

29 / 71

Quandl

I Quandl works as an aggregator of open and subscription databasesI The macroeconomic variables retrieved from FRED can also be obtained

from Quandl:

library(Quandl)macrodata <- Quandl(c("FRED/UNRATE", ’FRED/CPIAUCSL’, "FRED/GDPC1"),

start_date="1950-01-02", type="xts")head(macrodata)

FRED.UNRATE - Value FRED.CPIAUCSL - Value FRED.GDPC1 - Value1950-02-01 6.4 23.61 NA1950-03-01 6.3 23.64 NA1950-04-01 5.8 23.65 2147.61950-05-01 5.5 23.77 NA1950-06-01 5.4 23.88 NA1950-07-01 5.0 24.07 2230.4

30 / 71

I Quandl has many more datasets, e.g. commodity spot and futures prices

oil.spot <- Quandl("COM/WLD_CRUDE_WTI", type="xts")coffee.spot <- Quandl("COM/COFFEE_BRZL", type="xts")sp.futures <- Quandl("CHRIS/CME_SP1", type="xts")gold.futures <- Quandl("CHRIS/CME_GC3", type="xts")par(mfrow=c(1,4))plot(oil.spot); plot(coffee.spot); plot(sp.futures$Settle); plot(gold.futures$Settle)

Jan1982

Feb1990

Feb1998

Feb2006

Feb2014

2040

6080

100

120

oil.spot

May 012007

Nov 012010

May 012014

Oct 312017

1.0

1.5

2.0

2.5

3.0

coffee.spot

Apr 211982

Jan 021992

Jan 022002

Jan 032012

050

010

0015

0020

0025

00

sp.futures$Settle

Dec 311974

Jan 041988

Jan 042000

Jan 032012

500

1000

1500

gold.futures$Settle

31 / 71

Reading large files

I Files imported in R are stored in the memory of the program

I The physical limit to the file size that can imported is determined by theRAM of your machine (2, 4, 6GB)

I Reading large files can be time consuming when using the base functions

I Two packages are available to help with this task:I readr (function read_csv())I data.table (function fread())

I I will perform a speed comparison for the three functions using a commonbenchmark

32 / 71

I The benchmark for the comparison is obtained from the Center for Researchin Security Prices (CRSP) at the University of Chicago. The variables in thedataset are:

I PERMNO: identification number for each companyI date: date in format 2015/12/31I EXCHCD: exchange codeI TICKER: company tickerI COMNAM: company nameI CUSIP: another identification number for the securityI DLRET: delisting returnI PRC: priceI RET: returnI SHROUT: share oustandingI ALTPRC: alternative price

I The observations are all companies listed in the NYSE, NASDAQ, andAMEX from January 1985 until December 2016 at the monthly frequencyfor a total of 3,627,236 observations and 16 variable. The size of the file is328Mb

33 / 71

read_csv() from readr package

I The command Sys.time() reads the current time that is assigned tostart.time

I The time to perform the operation is calculated as the difference betweenthe Sys.time() and the start.time

start.time <- Sys.time()crsp <- read.csv("crsp_eco4051_jan2017.csv", stringsAsFactors = FALSE)end.csv <- Sys.time() - start.time

Time difference of 39.204 secs

I and for the read_csv() function?

library(readr)start.time <- Sys.time()crsp <- read_csv("crsp_eco4051_jan2017.csv")end_csv <- Sys.time() - start.time


I read_csv() is 7.1 times faster than read.csv()

34 / 71

fread() from data.table package

I Another function to read data fast is freadI The arguments:

I data.table: whether output will be a regular data frame or adata.table frame (TRUE/FALSE)

I showProgress: whether partial info about the percentage loadedshould be printed (TRUE/FALSE)

start.time <- Sys.time()crsp <- data.table::fread("crsp_eco4051_jan2017.csv",

data.table=FALSE,showProgress = FALSE)

end.fread <- Sys.time() - start.time


I fread is 12 times faster than read.csv() and 1.8 times faster thanread_csv()

35 / 71

Create returns

I Very often we need to create new variables that are transformations ofexisting variables

I One example is to calculate the return of an asset, that is:

1. Simple return: Rt = (Pt − Pt−1)/Pt−12. Logarithmic return: rt = log(Pt) − log(Pt−1)

I In macro this transformation is typically called the growth rateI The transformation is easily done in R using the lag() and diff() functions

GSPC <- getSymbols("^GSPC", from="1990-01-01", auto.assign = FALSE)GSPC$ret.simple <- 100 * (Ad(GSPC) - lag(Ad(GSPC), 1)) / lag(Ad(GSPC),1)GSPC$ret.log <- 100 * (log(Ad(GSPC)) - lag(log(Ad(GSPC)), 1))GSPC$ret.simple <- 100 * diff(Ad(GSPC)) / lag(Ad(GSPC), 1)GSPC$ret.log <- 100 * diff(log(Ad(GSPC)))head(GSPC)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted ret.simple ret.log1990-01-02 353.40 359.69 351.98 359.69 162070000 359.69 NA NA1990-01-03 359.69 360.59 357.89 358.76 192330000 358.76 -0.25855 -0.258891990-01-04 358.76 358.76 352.89 355.67 177000000 355.67 -0.86130 -0.865031990-01-05 355.67 355.67 351.35 352.20 158530000 352.20 -0.97562 -0.980411990-01-08 352.20 354.24 350.54 353.79 140110000 353.79 0.45145 0.450431990-01-09 353.83 354.17 349.61 349.62 155210000 349.62 -1.17867 -1.18567

36 / 71

I If we want to transform all the columns of a data frame or xts object we cansimply do the operation on the object rather than the variable

# "^GSPC" = S&P 500 Index, "^N225" = Nikkei 225, "^STOXX50E" = EURO STOXX 50data <- getSymbols(c("^GSPC", "^N225", "^STOXX50E"), from="2000-01-01")price <- merge(Ad(GSPC), Ad(N225), Ad(STOXX50E))ret <- 100 * diff(log(price))tail(ret, 5)

GSPC.Adjusted N225.Adjusted STOXX50E.Adjusted2018-01-19 0.437565 0.187892 NA2018-01-22 0.803436 0.034728 0.443242018-01-23 0.217200 1.284195 0.191072018-01-24 -0.056013 -0.763018 -0.794762018-01-25 NA -1.139636 NA

37 / 71

Elegant graphics: ggplot2 package

I The base plotting functions are easy to use and convenient for quick plottingI However, they lack elegance and it is difficult to produce high-quality

graphicsI The package ggplot2 offers an alternative set of function to make graphs

I There are two ways of producing plots with ggplot2:

1. qplot() is a wrapper function similar to plot() that uses underlyingggplot2 plotting functions

2. Using the grammar of graphics that is composed of:I ggplot(): creating the graph and speciying the data frame that

contains the variables to plotI geom_xxx(): the type of plot that is needed; point, line,

histogram, boxplot, etc (see list here)I aes(): the x and y variables to plotI theme(): the overall look of the plot; theme_bw(),

theme_classic(), theme_dark() etc.

38 / 71

http://docs.ggplot2.org/current/

I ggplot2 does not recognize the time series properties of xts objects and wehave to specify the x axis

I The time(GSPC) command is used to extract the date associated with eachobservation

GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)names(GSPC) <- sub(’GSPC.’,"", names(GSPC))library(ggplot2)qplot(time(GSPC), GSPC$Close, geom="line")

1000

2000

1990 2000 2010

time(GSPC)

GS

PC

$Clo

se

39 / 71

I ggplot2 interacts best with data framesI We can extract a data frame from the xts object by:

I creating a new variable that represents the DateI use coredata() to extract the data from the xts object

GSPC.df <- data.frame(Date = time(GSPC), coredata(GSPC))qplot(Date, Close, data = GSPC.df, geom = "line")

1000

2000

1990 2000 2010

Date

Clo

se

40 / 71

I We can produce the same graph using the ggplot2 grammar as in theexample below

I Notice that we can assign a ggplot to an object (called myplot) that can beused later and altered by only changing specific aspects

myplot <- ggplot(GSPC.df) + geom_line(aes(x = Date, y = Close))myplot

1000

2000

1990 2000 2010

Date

Clo

se

41 / 71

I The plots can be customized with themes, line colors and types, label namesetc

I The par(mfrow=c(2,2)) does not work with ggplot and we should use thegrid.arrange() function from package gridExtra

plot1 <- ggplot(GSPC.df, aes(Date, Adjusted)) + geom_line(color="darkgreen")plot2 <- plot1 + theme_bw()plot3 <- plot2 + theme_classic() + labs(x="", y="Index", title="S&P 500")plot4 <- plot3 + geom_line(color="darkorange") + geom_smooth(method="lm") +

theme_dark() + labs(subtitle="Period: 1985/2016", caption="Source: Yahoo")library(gridExtra)grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

1000

2000

1990 2000 2010

Date

Adj

uste

d

1000

2000

1990 2000 2010

DateA

djus

ted

1000

2000

1990 2000 2010

Inde

x

S&P 500

0

1000

2000

1990 2000 2010

Inde

x

Period: 1985/2016S&P 500

Source: Yahoo

42 / 71

I A scatter plot of two variables can be easily produced in ggplot2

data <- getSymbols(c("^GSPC", "^N225"), from="1990-01-01")price <- merge(Ad(to.monthly(GSPC)), Ad(to.monthly(N225)))ret <- 100 * diff(log(price))GN.df <- data.frame(Date=time(price), Year = year(time(price)), coredata(merge(price,ret)))names(GN.df) <- c("Date","Year", "SP", "NIK", "SPret","NIKret")

plot1 <- ggplot(GN.df, aes(NIKret, SPret)) + geom_point() +geom_vline(xintercept = 0) + geom_hline(yintercept = 0)

plot2 <- plot1 + geom_smooth(method="lm", se=FALSE) + theme_bw() +labs(x="NIKKEI", y="SP500")

grid.arrange(plot1, plot2, ncol=2)

−20

−10

0

10

−20 −10 0 10 20

NIKret

SP

ret

−20

−10

0

10

−20 −10 0 10 20

NIKKEI

SP

500

43 / 71

I The strength of ggplot2 is to make easier to produce sophisticated graphicsI For example: if we want to have the dots in the scatter plot depend on a

variable (e.g., Year) this can be done easily by adding the argumentcolor=Year in the aesthetics

GN.df$Year <- year(time(price))plot1 <- ggplot(GN.df, aes(NIKret, SPret, color=Year)) + geom_point() +

geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")plot2 <- ggplot(GN.df, aes(NIKret, SPret, color=factor(Year))) + geom_point() +

geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")grid.arrange(plot1, plot2, ncol=2)

−20

−10

0

10

−20 −10 0 10 20

1990

2000

2010

Year

−20

−10

0

10

−20 −10 0 10 20

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

44 / 71

Boxplot

I A boxplot provides a graphical representation of the distribution of the dataI Box-whisker plot: the box represents the 25%, 50%, and 75% quantile and

the whiskers extend to the highest value within 1.5 times the interquartilerange; the dots are the min/max

I In the graph below the boxplot is plotted for each year separately

ggplot(GN.df, aes(factor(Year), SPret)) + geom_boxplot() +theme(axis.text.x = element_text(angle = 90))

−20

−10

0

10

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

factor(Year)

SP

ret

45 / 71

Summary statistics

I The function summary() provides a few summary statistics of thedistribution of a data object

I Of course, the variable needs to be of type numerical

summary(GSPC$ret.simple)

Index ret.simpleMin. :1990-01-02 Min. :-9.03501st Qu.:1996-12-26 1st Qu.:-0.4399Median :2004-01-07 Median : 0.0530Mean :2004-01-07 Mean : 0.03533rd Qu.:2011-01-13 3rd Qu.: 0.5541Max. :2018-01-24 Max. :11.5800

NA's :1

summary(GSPC$ret.log)

Index ret.logMin. :1990-01-02 Min. :-9.46951st Qu.:1996-12-26 1st Qu.:-0.4409Median :2004-01-07 Median : 0.0530Mean :2004-01-07 Mean : 0.02923rd Qu.:2011-01-13 3rd Qu.: 0.5525Max. :2018-01-24 Max. :10.9572

NA's :1

46 / 71

I Package fBasics provides the basicStats() function with a morecomprehensive set of descritive statistics compared to summary()

fBasics::basicStats(GSPC$ret.log)

ret.lognobs 7072.000000NAs 1.000000Minimum -9.469512Maximum 10.9571971. Quartile -0.4409193. Quartile 0.552538Mean 0.029210Median 0.052970Sum 206.545022SE Mean 0.013172LCL Mean 0.003390UCL Mean 0.055031Variance 1.226761Stdev 1.107592Skewness -0.252791Kurtosis 9.004323

47 / 71

http://cran.r-project.org/web/packages/fBasics/index.html

Covariance and correlation between two or more assets

I When we have several variables or assets the first question that arises in theanalysis is whether they co-move

I Dependence is measured using the covariance and the correlation

I Do the S&P 500 and NIKKEI move togheter?

Ret <- subset(GN.df, select=c("SPret","NIKret"))cov(Ret, use=’complete.obs’)

SPret NIKretSPret 16.960 13.446NIKret 13.446 38.731

cor(Ret, use=’complete.obs’)

SPret NIKretSPret 1.00000 0.52463NIKret 0.52463 1.00000

48 / 71

Plotting the data distribution

I A very useful tool to explore the distribution of the data is the histogramthat represents an estimator of the underlying (population) of the data. It isuseful to assess (visually) the characteristics of the data, such as normality,fat tails, asymmetry

hist(GSPC$ret.log, breaks=50, xlab="", main="") # base functionqplot(ret.log, data=GSPC, geom="histogram", bins=50) # ggplot function

Fre

quen

cy

−10 −5 0 5 10

050

010

0015

00

0

500

1000

1500

−10 −5 0 5 10

ret.log

coun

t

49 / 71

I We can overlap a non-parametric estimate to the histogram which representsa smooth line that goes through the histogram bars

# base functionhist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)lines(density(GSPC$ret.log,na.rm=TRUE),col=2,lwd=2)box()# ggplot functionggplot(GSPC, aes(ret.log)) +

geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +geom_density(color="red", size=1.2) +theme_bw()

Return

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.2

0.4

0.6

−10 −5 0 5 10

ret.log

dens

ity

50 / 71

I Or compare the histogram to a distribution (e.g., normal)

# base functionhist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)curve(dnorm(x, mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)),

from=-10, to=10, add=TRUE, col="red",lwd=2)box()# ggplot functionggplot(GSPC, aes(ret.log)) +

geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +stat_function(fun = dnorm, colour = "red",

args = list(mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)), size=1.2) +theme_bw()

Return

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.2

0.4

0.6

−10 −5 0 5 10

ret.log

dens

ity

51 / 71

Dates and times in R

I We already used the command as.Date() to define a string to be of typeDate

I The default format of a date in R is 2011-07-17I If in your dataset a date is specified in a different way, you need to help R

read it by specifying the format=I The syntax of the format is: %d for numerical day, %a and %A

abbreviated/unabbreviated weekday, %m numerical month, %b and %B forabbreviated/unabbreviated month, %y and %Y for 2/4 digit year

as.Date("2011-07-17") # default, no need to specify formatas.Date("July 17, 2011", format="%B %d,%Y")as.Date("Monday July 17, 2011", format="%A %B %d,%Y")as.Date("17072011", format="%d%m%Y")as.Date("11@17#07", format="%y@%d#%m")

[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17"

52 / 71

I One operation we might want to do with dates is to calculate the differencebetween two dates

I This can be done by subtracting two dates or using the difftime() functionthat allows also to specify the unit of time

date1 <- as.Date("July 17, 2011", format="%B %d,%Y")date2 <- Sys.Date()date2 - date1difftime(date2, date1, units="secs")difftime(date2, date1, units="days")difftime(date2, date1, units="weeks")

Time difference of 2384 daysTime difference of 205977600 secsTime difference of 2384 daysTime difference of 340.57 weeks

53 / 71

Time

I In addition to the date, we might need to specify the time of the dayI This is useful when dealing with intra-day data, such as the FX data that

we discussed earlier and shown below

data.hf <- data.table::fread(’USDJPY-2016-12.csv’,col.names=c("Pair","Date","Bid","Ask"),colClasses=c("character","character","numeric","numeric"),data.table=FALSE, showProgress = FALSE)

Pair Date Bid Ask1 USD/JPY 20161201 00:00:00.041 114.68 114.692 USD/JPY 20161201 00:00:00.042 114.68 114.693 USD/JPY 20161201 00:00:00.186 114.68 114.694 USD/JPY 20161201 00:00:00.188 114.68 114.695 USD/JPY 20161201 00:00:00.189 114.69 114.706 USD/JPY 20161201 00:00:00.223 114.69 114.707 USD/JPY 20161201 00:00:00.343 114.69 114.708 USD/JPY 20161201 00:00:00.347 114.69 114.709 USD/JPY 20161201 00:00:00.403 114.69 114.6910 USD/JPY 20161201 00:00:00.415 114.69 114.69

54 / 71

I To work with time we can use two functions:I strptime()I as.POSIXlt()

I Both require to specify the format of the date and time partI The format of the time is: %H hour (out of 24), %M minute, %S seconds, and

%OS fractional seconds

strptime("20161201 01:00", format="%Y%m%d %H:%M")strptime("20161201 00:00:01", format="%Y%m%d %H:%M:%S")strptime("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")##date1 <- as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")date2 <- strptime("20161201 01:15:00.041", format="%Y%m%d %H:%M:%OS")date2 - date1difftime(date2, date1, unit="secs")

[1] "2016-12-01 01:00:00 EST"[1] "2016-12-01 00:00:01 EST"[1] "2016-12-01 00:00:00 EST"[1] "2016-12-01 00:00:00 EST"Time difference of 1.25 hoursTime difference of 4500 secs

55 / 71

lubridate package

I This package makes it easier to define dates by having dedicated functionsthat do the as.Date and format together

I These functions are:I ymd: for dates in the format year, month, dayI dmy: dates with day, month, year formatI mdy: when the format is month, day, yearI ymd_hm: in addition to the date the time is provided in hour and

minute (the date part can be changed to other formats)I ymd_hms: the time format is hour, minute, and seconds

library(lubridate)ymd("20110717")ymd("2011/07/17")ymd_hm("20110717 01:00")ydm_hms("20111707 00:00:00.041")

[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17 01:00:00 UTC"[1] "2011-07-17 00:00:00 UTC"

56 / 71

I The package provides functions that makes it easy to extract the year,month, day, day of the week/month/year, minute, second etc

mydate <- ydm_hms("20111707 00:00:00.041")year(mydate)month(mydate)day(mydate)wday(mydate, label=T, abbr=FALSE)minute(mydate)second(mydate)

[1] 2011[1] 7[1] 17[1] SundayLevels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday[1] 0[1] 0.041

57 / 71

dplyr package

I The package dplyr provides functions to manipulate data framesI Data analysis requires a lot of data manipulation and this package has made

the task significantly easierI dplyr has 5 main verbs:

I mutate: to create new variablesI select: to select columns of the data frameI filter: to select rows based on a criterionI group_by()/summarize: uses a function to summarize

columns/variables in one valueI arrange: to order a data frame based on one or more variables

I dplyr is typically used with the %>% piping operator (read then)I %>% is useful to write compact code when we are not interested in

using or storing the intermediate results

58 / 71

mutate and select

I mutate() to create new variablesI select() to select existing variables

library(dplyr)library(lubridate)

GSPC.df <- mutate(GSPC.df, range = 100 * log(High/Low),ret.c2c = 100 * log(Adjusted / lag(Adjusted)),year = year(Date),month = month(Date),wday = wday(Date, label=T, abbr=F))

tail(GSPC.df, 2)

Date Open High Low Close Volume Adjusted range ret.c2c year month wday8334 2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1 0.41073 0.217200 2018 1 Tuesday8335 2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5 0.99195 -0.056013 2018 1 Wednesday

Date year month wday range ret.c2c8334 2018-01-23 2018 1 Tuesday 0.41073 0.2172008335 2018-01-24 2018 1 Wednesday 0.99195 -0.056013

59 / 71

filter()

I the filter() function is used to select specific rows of the data frame

filter(GSPC.df, wday == "Tuesday") %>% head(7)

Date Open High Low Close Volume Adjusted range ret.c2c year month wday1 1985-01-08 164.24 164.59 163.91 163.99 92110000 163.99 0.41400 -0.152332 1985 1 Tuesday2 1985-01-15 170.51 171.82 170.40 170.81 155300000 170.81 0.82988 0.175790 1985 1 Tuesday3 1985-01-22 175.23 176.63 175.14 175.48 174800000 175.48 0.84715 0.142568 1985 1 Tuesday4 1985-01-29 177.40 179.19 176.58 179.18 115700000 179.18 1.46727 0.998381 1985 1 Tuesday5 1985-02-05 180.35 181.53 180.07 180.61 143900000 180.61 0.80753 0.144058 1985 2 Tuesday6 1985-02-12 180.51 180.75 179.45 180.56 111100000 180.56 0.72182 0.027697 1985 2 Tuesday7 1985-02-19 181.60 181.61 180.95 181.33 90400000 181.33 0.36408 -0.148791 1985 2 Tuesday

60 / 71

group_by()/summarize()

I A common operation in data analysis is to group observations based on acertain characteristic and apply a certain function to each group

I Example: calculate the average/min/max return by day of the week

GSPC.df %>% group_by(wday) %>%summarize(AV.RET = mean(ret.c2c, na.rm=T),

MIN.RET = min(ret.c2c, na.rm=T),MAX.RET = max(ret.c2c, na.rm=T))

# A tibble: 5 x 4wday AV.RET MIN.RET MAX.RET

<ord> <dbl> <dbl> <dbl>1 Monday 0.010565 -22.8997 10.95722 Tuesday 0.066759 -5.9108 10.24573 Wednesday 0.054635 -9.4695 8.70894 Thursday 0.016346 -7.9224 6.69235 Friday 0.019671 -7.0082 6.1328

61 / 71

I The grouping can also be done on two variables, for example month and year

GSPC.df %>% group_by(month, year) %>%summarize(AV.RET = mean(ret.c2c, na.rm=T),

MIN.RET = min(ret.c2c, na.rm=T),MAX.RET = max(ret.c2c, na.rm=T))

# A tibble: 397 x 5# Groups: month [?]

month year AV.RET MIN.RET MAX.RET<dbl> <dbl> <dbl> <dbl> <dbl>

1 1 1985 0.393875 -0.54228 2.25662 1 1986 0.010744 -2.76472 1.48433 1 1987 0.589429 -1.40073 2.30244 1 1988 0.198181 -7.00824 3.52315 1 1989 0.327143 -0.87157 1.48546 1 1990 -0.324090 -2.61989 1.87107 1 1991 0.184905 -1.74726 3.66428 1 1992 -0.091477 -1.11960 1.46159 1 1993 0.035106 -0.87605 0.8903

10 1 1994 0.152304 -0.58097 1.1363# ... with 387 more rows

62 / 71

Does volatility of the S&P 500 vary over time?

I dplyr functions can be very useful to write relatively lengthy operations in acompact and readable manner

I Assume we want to calculate the average volatility by year using theintra-day range as a proxy for volatility

GSPC.df %>% mutate(range = 100 * log(High/Low),year = year(Date)) %>%

group_by(year) %>%summarize(av.range = mean(range, na.rm=T)) %>%ggplot(., aes(year, av.range)) + geom_line(color="steelblue4", size=1.3) +theme_bw() + labs(x="", y="")

1

2

1990 2000 2010

63 / 71

Creating functions in R

I The advantage of a programming language is that you have the flexibility towrite your own functions. This can be useful when:

1. No package provides pre-programmed functions to perform theanalysis you want to conduct

2. The task is very complex and you prefer to break it down in smallertasks that make the code easier to read, interpret, and test.

3. Once you write a function, you can use it again in future analysis

64 / 71

I A function is a set of operations applied to some data

I The syntax is as follows:

myfunction <- function(inputs){

## operations#

return(output)}

I The function() can include several argumentsI The return() is the output of the function that can include only one object,

although that object might include several elementsI To call the function you type in R the name of the function with the

appropriate arguments: myfunction()

65 / 71

A function to calculate the sample average

I R provides the mean() function to calculate the sample average

mean(GSPC$ret.log, na.rm=T)

[1] 0.02921

I As an illustration, let’s write a function that calculates the sample average:R̄ =

∑Tt=1 Rt/T

I In this case the set of operations to perform is quite simple:

1. sum the values of the time series2. divide by the total number of observations

66 / 71

I Below is the code that defines a new function called mymean()

# Y is the input, Ybar the output of the functionmymean <- function(Y){

Y = na.omit(Y)Ybar <- sum(Y) / length(Y)return(Ybar)

}

I Let’s compare the results:

mean(GSPC$ret.log, na.rm=T)

[1] 0.02921

mymean(GSPC$ret.log)

[1] 0.02921

67 / 71

Loops in R

I Loops are a useful tool when you want to perform the same set of operationson several time series or datasets

I A common loop is the for loop which has the following syntax:

for (i in 1:N){

# write your commands here}

I The loop iterates through the values of i from 1 to NI Example:

for (i in 1:3){

print(i)}

[1] 1[1] 2[1] 3

68 / 71

mysum() function using a for loop

mysum <- function(Y){

Y = na.omit(Y)N = length(Y) # define N as the number of elements of YsumY = 0 # initialize the variable that will store the sum of Y

for (i in 1:N){

sumY = sumY + as.numeric(Y[i]) # current sum is equal to previous sum} # plus the i-th value of Yreturn(sumY) # as.numeric(): makes sure to transform

} # from other classes to a number

mysum(GSPC$ret.log)

[1] 206.55

sum(GSPC$ret.log, na.rm=T)

[1] 206.55

69 / 71

A simulation exercise

I Simulations are used to evaluate some quantities (e.g., the price of an optionor an estimator) based on large number of samples generated from a certaindistribution

I The recipe works as follows:

1. generate random values from a model2. calculate the quantity of interest3. repeat 1 and 2 many times

I Example: we want to evaluate if the distribution of the sample mean isN(µ, σ2/N), where µ is the population mean, σ2 the population variance,and N the sample size

I In the example in the following slide we generate data from a normaldistribution with mean 0 and standard deviation 2

70 / 71

S = 5000 # set the number of simulationsN = 1000 # set the length of the samplemu = 0 # population meansigma = 2 # population standard deviation

Ybar = vector(’numeric’, S) # create an empty vector of S elements# to store the t-stat of each simulation

for (i in 1:S){

Y = rnorm(N, mu, sigma) # Generate a sample of length NYbar[i] = mean(Y) # store the t-stat

}ggplot(data = data.frame(Ybar=Ybar), aes(Ybar)) +

geom_histogram(aes(y = ..density..), color="red", fill="lightsalmon", bins = 40) +stat_function(fun = dnorm, args = list(mean = 0, sd = sigma/sqrt(N)), color="seagreen",size=1.2) +theme_bw() + xlim(c(-0.3, 0.3))

0

2

4

6

−0.2 0.0 0.2

Ybar

dens

ity

71 / 71

getting started with r - baruch...

Documents