getting started with r - baruch...
TRANSCRIPT
Getting started with R
Sebastiano Manzan
ECO 4051 | Spring 2018
1 / 71
Why R?
I R is becoming increasingly popular in economics and finance
I It is open source, simple to use, and with numerous packages (12,126 as oftoday) contributed by a large community of users on any aspect ofstatistical modeling and data analysis
I If you can have an excellent free product why should you pay for anexcellent expensive product (e.g., Matlab, SAS)?
I Learning a programming language can be a useful skill in the current labormarket where companies are increasingly interested to extract information(intelligence) from large datasets they collect about their business
I Trend in the industry (just two example):I Microsoft bought Revolution R and renamed Microsoft R (a version
of R that is optimized to work on multiple cores)I IBM bought SPSS and many data/analytics providers (e.g., the
Weather Channel), and sponsors Cognitive Class.ai that offers freeonline courses about data science and machine learning
2 / 71
Outline of the course
1. Getting started with R2. Linear Regression Model3. Time series models4. Volatility modeling5. High-frequency data6. Measuring Financial Risk
3 / 71
Lets get started with Rstudio
Figure 2: Rstudio5 / 71
How R works
I R creates and works with objects that contain data
I The object/data can be of a different structure such as:I data frame: a table where each column represents a variable and
each row a different observation (different time period or unit); avariable can be numerical or a string (similar to an Excel spreadsheet)
I matrix: same as data frame, but all variables/columns have to be ofthe same type (typically all numbers)
I lists: an object of objects; each element of the list can be, e.g., a dataframe, a matrix, and a vector (similar to a set of Excel spreadsheets)
I Function: in R we can create functions that takes a set of arguments andperform a set of operations on a data object; e.g., mean(x, na.rm=T)
I Package: a group of functions with a specific purpose (e.g., ggplot2)I install a package: install.packages("ggplot2") (only done once)I use the package: library(ggplot2) or require(ggplot2)
6 / 71
Loading data in R
I It is convenient to start a R session by setting the working directory wherethe data/files are stored; for example:
I setwd('/Users/username/Baruch/ECO4051/') in Mac/UnixI setwd('c:/Baruch/ECO4051/') in Windows
I Two ways to load a dataset in R:
1. import the data from a local file2. import the data from an online resource (e.g., Yahoo Finance, FRED,
Google Finance, Quandl)
7 / 71
Base function read.csv()
I You can load a file from Rstudio via Tools -> Import Dataset and thenyou are given the option From Text File or From Web URL
I Otherwise, you can type a few lines of code (table from Wikipedia):
splist <- read.csv("List_SP500.csv")head(splist,10)
Ticker.symbol Security Address.of.Headquarters Date.first.added1 MMM 3M Company St. Paul, Minnesota2 ABT Abbott Laboratories North Chicago, Illinois 1964-03-313 ABBV AbbVie Inc. North Chicago, Illinois 2012-12-314 ACN Accenture plc Dublin, Ireland 2011-07-065 ATVI Activision Blizzard Santa Monica, California 2015-08-316 AYI Acuity Brands Inc Atlanta, Georgia 2016-05-037 ADBE Adobe Systems Inc San Jose, California 1997-05-058 AMD Advanced Micro Devices Inc Sunnyvale, California 2017-03-209 AAP Advance Auto Parts Roanoke, Virginia 2015-07-0910 AES AES Corp Arlington, Virginia
I The commands head( ,n) and tail( ,n) show the first and last nobservations
8 / 71
Data types
I The str() command can be used to evaluate the object structure and thedata types:
str(splist)
'data.frame': 505 obs. of 4 variables:$ Ticker.symbol : Factor w/ 505 levels "A","AAL","AAP",..: 314 7 5 8 52 58 9 33 3 17 ...$ Security : Factor w/ 505 levels "3M Company","A.O. Smith Corp",..: 1 3 4 5 6 7 8 10 9 11 ...$ Address.of.Headquarters: Factor w/ 256 levels "Akron, Ohio",..: 222 159 159 66 210 8 204 226 195 5 ...$ Date.first.added : Factor w/ 303 levels "","1964-03-31",..: 1 2 213 195 252 273 79 292 249 1 ...
I Each variable in the data frame splist has a type that can be:I numeric: (or double) is used for decimal valuesI integer: for integer valuesI character: for strings of charactersI Date: for datesI factor: represents a type of variable (either numeric, integer, or
character) that categorizes the values in a small (relative to the samplesize) set of categories (or levels)
9 / 71
I The read.csv() function has the annoying feature that any string isinterpreted as a factor
I This can be switched off by adding the argument stringsAsFactors = FALSE
splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)str(splist)
'data.frame': 505 obs. of 4 variables:$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois" "Dublin, Ireland" ...$ Date.first.added : chr "" "1964-03-31" "2012-12-31" "2011-07-06" ...
I The ticker symbol, security name, and address are all correctly interpretedas chr
I The date.first.added is also imported as a string, but we would like todefine it of type Date
10 / 71
I The code below is used to define the column/variable Date.first.added asa date with command as.Date()
splist$Date.first.added <- as.Date(splist$Date.first.added, format="%Y-%m-%d")str(splist)
'data.frame': 505 obs. of 4 variables:$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois" "Dublin, Ireland" ...$ Date.first.added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...
I $ sign is used to extract a variable/column in a data frame;splist$Date.first.added extract the Date.first.added from the dataframe object splist
I as.Date() function converts the splist$Date.first.added from characterto Date; the role of the argument format="%Y-%m-%d" is to specify theformat of the date being defined
11 / 71
read_csv() from readr package
I In addition to the base read.csv() function, there are other packages thatprovide functions to read data
I There are two problems with the read.csv() function:I type guessing (in particular dates)I reading speed
I The function read_csv() from package readr tries to solve both problems(will talk about speed later):
library(readr)splist <- read_csv("List_SP500.csv")str(splist, max.level=1)
Classes 'tbl_df', 'tbl' and 'data.frame': 505 obs. of 4 variables:$ Ticker symbol : chr "MMM" "ABT" "ABBV" "ACN" ...$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...$ Address of Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois" "Dublin, Ireland" ...$ Date first added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...- attr(*, "spec")=List of 2..- attr(*, "class")= chr "col_spec"
I Notice:I class tbl (tibble) and tbl_df is a type of data frame specific to this
packageI Date defined as a Date which saves us a line of code
12 / 71
I The file GSPC.csv represents daily data for the S&P 500 Index from January1985 downloaded from Yahoo Finance
I Below is a comparison of read.csv() and read_csv()
index <- read.csv("GSPC.csv", stringsAsFactors = FALSE)
'data.frame': 8237 obs. of 7 variables:$ Date : chr "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...$ GSPC.Open : num 167 165 165 164 164 ...$ GSPC.High : num 167 166 165 165 165 ...$ GSPC.Low : num 165 164 163 164 164 ...$ GSPC.Close : num 165 165 164 164 164 ...$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...$ GSPC.Adjusted: num 165 165 164 164 164 ...
index <- read_csv("GSPC.csv")
Classes 'tbl_df', 'tbl' and 'data.frame': 8237 obs. of 7 variables:$ Date : Date, format: "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...$ GSPC.Open : num 167 165 165 164 164 ...$ GSPC.High : num 167 166 165 165 165 ...$ GSPC.Low : num 165 164 163 164 164 ...$ GSPC.Close : num 165 165 164 164 164 ...$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...$ GSPC.Adjusted: num 165 165 164 164 164 ...- attr(*, "spec")=List of 2..- attr(*, "class")= chr "col_spec"
13 / 71
Saving data files
I In addition to read/import files, we might also need to save data files
I This can be done with the base function write.csv() (see help(write.csv)for the arguments)
index <- read_csv("GSPC.csv")write.csv(index, file = "myfile.csv", row.names = FALSE)
I The index object is saved to a file called myfile.csv in the working directory
14 / 71
Plotting the data . . .
I Visualization is an essential part of data analysisI It is useful to guide the analysis and to communicate our resultsI It is typically easier to process and understand a graph rather than a table
with numbersI The base function in R to plot data is plot() that takes as arguments:
I x, y: the variables to plot in the x-axis and y-axisI type: p for points, l for lines, b for both (and more)I xlim, ylim: the range of the axesI xlab, ylab: the labels of the axesI main: string to use for titleI col: color of the point and/or lineI pch: type of point to use
15 / 71
I The code below produces a time series plot of the S&P 500 Index:I The column index$Date is defined of class Date and used as the x-axisI The column index$GSPC.Adjusted is used as the y-axisI The left plot uses the default settings, the plot on the right has been
customized
# LEFT PLOTplot(index$Date, index$GSPC.Adjusted)# RIGHT PLOTplot(index$Date, index$GSPC.Adjusted, type="l", xlab="", ylab="S&P 500 Index", xaxt="n", yaxt="n")ticks <- seq(index$Date[1], index$Date[nrow(index)], by="year")axis(1, at=ticks, labels=ticks, cex.axis=0.9, col="orange", col.axis="blue")axis(2, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3,cex.axis=0.75,col.axis="purple")axis(4, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3, cex.axis=0.75,col.axis="purple")
1985 1990 1995 2000 2005 2010 2015
500
1500
2500
index$Date
inde
x$G
SP
C.A
djus
ted
S&
P 5
00 In
dex
1985−01−02 1995−01−02 2005−01−02 2015−01−02
500
1000
1500
2000
500
1000
1500
2000
16 / 71
Time series objects
I A variable that is observed over time is called a time series (e.g., stockprices, real GDP, inflation)
I There are several packages that provide infrastructures to define an objectas a time series object
I I will mostly use the xts package (that is part of the quantmod package forquantitative finance; other packages are ts and zoo)
I To define an object as a time series we use the command xts() that takestwo arguments:
I a data frameI a vector of dates (of class Date)
library(xts)index.xts <- xts(subset(index, select=-Date), order.by=index$Date)
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted1985-01-02 167.20 167.20 165.19 165.37 67820000 165.371985-01-03 165.37 166.11 164.38 164.57 88880000 164.571985-01-04 164.55 164.55 163.36 163.68 77480000 163.681985-01-07 163.68 164.71 163.68 164.24 86190000 164.24
17 / 71
I The xts package provides functions to extract information from a timeseries object:
start(index.xts) # start dateend(index.xts) # end dateperiodicity(index.xts) # periodicity/frequency (daily, weekly, monthly)
[1] "1985-01-02"[1] "2017-09-01"Daily periodicity from 1985-01-02 to 2017-09-01
I There are also functions to aggregate the observations from high frequency(e.g., daily) to lower frequency (e.g., weekly/monthly/quarterly)
I By default the sub-sampling is performed by taking the first observation ofthe interval (e.g., monday of each week, 1st of the month)
index.weekly <- to.weekly(index.xts)
index.xts.Open index.xts.High index.xts.Low index.xts.Close index.xts.Volume index.xts.Adjusted1985-01-04 167.20 167.20 163.36 163.68 234180000 163.681985-01-11 163.68 168.72 163.68 167.91 509830000 167.911985-01-18 167.91 171.94 167.58 171.32 634000000 171.321985-01-25 171.32 178.16 171.31 177.35 749100000 177.35
18 / 71
I Functions apply.weekly() and apply.monthly() are used when the goal isto apply a function to each week/month in the sample
I In the examples below I apply these functions to subsample the first() andlast() day of the week (notice that the first() is equivalent to theto.weekly() function) and to calculate the mean() of the week
index.weekly <- apply.weekly(index.xts, "first")
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted1985-01-04 167.20 167.20 165.19 165.37 67820000 165.371985-01-11 163.68 164.71 163.68 164.24 86190000 164.24
index.weekly <- apply.weekly(index.xts, "last")
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted1985-01-04 164.55 164.55 163.36 163.68 77480000 163.681985-01-11 168.31 168.72 167.58 167.91 107600000 167.91
index.weekly <- apply.weekly(index.xts, "mean")
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted1985-01-04 165.71 165.95 164.31 164.54 78060000 164.541985-01-11 165.08 166.38 164.83 165.93 101966000 165.93
19 / 71
Plotting time series data
I Once the object is defined as xts, plotting is executed by plot.xts() whichhas the advantage that:
I the x-axis is by default defined to be timeI easier rendering of dates through the grids
I Using plot() on an xts object automatically calls plot.xts()
GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)plot(Ad(GSPC))
Jan 021985
Jan 041988
Jan 021992
Jan 021996
Jan 032000
Jan 022004
Jan 022008
Jan 032012
Jan 042016
500
1000
1500
2000
2500
Ad(GSPC)
20 / 71
Subsetting
I The xts package provides its own syntax to subset the time series objectI Below are some examples:
par(mfrow=c(2,3)) # organize the plots in 2 rows and 3 columnsplot(index.xts[’2007’])plot(index.xts[’2007/2009’])plot(index.xts[’/2007’])plot(index.xts[’2007/’])plot(index.xts[’2007-03-21/2008-02-12’])plot(index.xts[.indexwday(index.xts)==3])
Jan 032007
Apr 022007
Jul 022007
Oct 012007
Dec 312007
1400
1500
index.xts["2007"]
Jan 032007
Jan 022008
Jan 022009
Dec 312009
800
1200
1600
index.xts["2007/2009"]
Jan 021985
Jan 021990
Jan 031995
Jan 032000
Jan 032005
200
600
1000
1400
index.xts["/2007"]
Jan 032007
Jan 022009
Jan 032011
Jan 022013
Jan 022015
Jan 032017
1000
2000
index.xts["2007/"]
Mar 212007
Jun 012007
Aug 012007
Oct 012007
Dec 032007
Feb 122008
1350
1450
1550
index.xts["2007−03−21/2008−02−12"]
Jan 031985
Jan 021992
Jan 081998
Jan 082004
Jan 072010
Jan 072016
500
1500
2500
index.xts[.indexwday(index.xts) == 3]
21 / 71
getSymbols() from the quantmod package
I There are several packages that provide functions to download economic andfinancial data by only specifying the ticker and time period (and frequencyin some functions)
I I will discuss only the getSymbols() function from package quantmod()which will be used in this class
I Features of getSymbols():I Sources: Yahoo Finance, Google Finance, OANDA (fx rates), FRED
(argument src)I Download multiple tickers in one callI Select the time period (with arguments from and to)I By default the output is a xts object for each ticker specifiedI Yahoo Finance: downloads open, high, low, close, volume, and
adjusted close at the daily frequencyI You can convert to weekly or monthly use to.weekly()/to.monthly()
or the apply.weekly()/apply.monthly()
22 / 71
getSymbols() with one ticker
I By default, the function creates a xts object with the name of the ticker(except the ˆ part if you are downloading an index)
I When downloading only one ticker, setting auto.assign=FALSE allows you toassign the output to an object that you name (in the example below data)
library(quantmod)getSymbols("^GSPC", src = "yahoo", from = "1990-01-01")
[1] "GSPC"
tail(GSPC, 2)
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.12018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5
data <- getSymbols("^GSPC", src = "yahoo", from = "1990-01-01", auto.assign = FALSE)tail(data, 2)
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.12018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5
23 / 71
Multiple tickers
I A vector of symbols can be passed to getSymbols() to download data formultiple assets
I The function will create a xts object for each ticker (using as name theticker; the auto.assign options does not work for more than one ticker)
I When you download more than 5 symbols you will see a message pausing 1second between requests for more than 5 symbols
library(quantmod)getSymbols(c("^GSPC","^DJI"), src="yahoo", from="1990-01-01")periodicity(GSPC)periodicity(DJI)
[1] "GSPC" "DJI"Daily periodicity from 1990-01-02 to 2018-01-24Daily periodicity from 1990-01-02 to 2018-01-24
24 / 71
I getSymbols() allows you also to specify an environment (argument env)I Think of the environment as a folder in the R global environment where the
objects will be storedI Steps:
I create a new environment with new.env() command (called myenvbelow)
I call the getSymbols() function and set the env= argument to the newenvironment you created
I the ls() command below lists the objects in the new environmentmyenv
splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)myenv <- new.env()getSymbols(splist$Ticker.symbol[1:10], env=myenv, from="2010-01-01", src="yahoo")
[1] "MMM" "ABT" "ABBV" "ACN" "ATVI" "AYI" "ADBE" "AMD" "AAP" "AES"
ls(myenv)
[1] "AAP" "ABBV" "ABT" "ACN" "ADBE" "AES" "AMD" "ATVI" "AYI" "MMM"
25 / 71
quantmod functionalities
I The object created contains all the information, but for our analysis wemight need only some of the columns
I The package provides functions to extract the open price Op(), the closingprice Cl(), the highest intra-day price Hi(), the lowest Lo(), the volumeVo(), and the adjusted closing price Ad()
I OpCl() calculates the open-to-close daily return, ClCl() for the close-to-closereturn, and LoHi() for the low-to-high difference (also called the intra-dayrange)
data.new <- merge(Ad(GSPC),Ad(DJI))
GSPC.Adjusted DJI.Adjusted1990-01-02 359.69 2810.11990-01-03 358.76 2809.71990-01-04 355.67 2796.1
26 / 71
Oanda
I Daily exchange rates for a wide range of currency pairsI Limit of 2000 days for requestI command oanda.currencies gives you the symbols for 191 currencies
getSymbols(c("USD/EUR", "USD/JPY"), src="oanda")
[1] "USDEUR" "USDJPY"
par(mfrow=c(1,2))plot(USDEUR)plot(USDJPY)
Jul 302017
Sep 042017
Oct 092017
Nov 132017
Dec 182017
Jan 222018
0.81
0.83
0.85
USDEUR
Jul 302017
Sep 042017
Oct 092017
Nov 132017
Dec 182017
Jan 222018
108
110
112
114
USDJPY
27 / 71
FRED
I Federal Reserve Economic Data (FRED) can be used to downloadmacroeconomic time series for the US economy and also international
I Visit FRED to find out the symbol of the variable(s) you are interested todownload
I Below are some examples:
library(quantmod)macrodata <- getSymbols(c(’UNRATE’,’CPIAUCSL’,’GDPC1’), src="FRED")macrodata <- merge(UNRATE, CPIAUCSL, GDPC1)par(mfrow=c(1,3))plot(UNRATE); plot(CPIAUCSL); plot(GDPC1)
Jan1948
Jan1960
Jan1975
Jan1990
Jan2005
Dec2017
46
810
UNRATE
Jan1947
Jan1960
Jan1975
Jan1990
Jan2005
Dec2017
5010
015
020
025
0
CPIAUCSL
Jan1947
Jan1960
Jan1975
Jan1990
Jan2005
Jul2017
5000
1000
015
000
GDPC1
28 / 71
I Exchange rates are also available in FRED (no restriction on the timeperiod)
# DEXUSEU: U.S. Dollars to One Euro# DEXJPUS: Japanese Yen to One U.S. Dollarmacrodata <- getSymbols(c(’DEXUSEU’,’DEXJPUS’), src="FRED", from="1975-01-01")par(mfrow=c(1,2))plot(DEXUSEU)plot(DEXJPUS)
Jan 041999
Jan 012003
Jan 012007
Jan 032011
Jan 012015
0.8
1.0
1.2
1.4
1.6
DEXUSEU
Jan 041971
Jan 011980
Jan 011990
Jan 032000
Jan 012010
100
150
200
250
300
350
DEXJPUS
29 / 71
Quandl
I Quandl works as an aggregator of open and subscription databasesI The macroeconomic variables retrieved from FRED can also be obtained
from Quandl:
library(Quandl)macrodata <- Quandl(c("FRED/UNRATE", ’FRED/CPIAUCSL’, "FRED/GDPC1"),
start_date="1950-01-02", type="xts")head(macrodata)
FRED.UNRATE - Value FRED.CPIAUCSL - Value FRED.GDPC1 - Value1950-02-01 6.4 23.61 NA1950-03-01 6.3 23.64 NA1950-04-01 5.8 23.65 2147.61950-05-01 5.5 23.77 NA1950-06-01 5.4 23.88 NA1950-07-01 5.0 24.07 2230.4
30 / 71
I Quandl has many more datasets, e.g. commodity spot and futures prices
oil.spot <- Quandl("COM/WLD_CRUDE_WTI", type="xts")coffee.spot <- Quandl("COM/COFFEE_BRZL", type="xts")sp.futures <- Quandl("CHRIS/CME_SP1", type="xts")gold.futures <- Quandl("CHRIS/CME_GC3", type="xts")par(mfrow=c(1,4))plot(oil.spot); plot(coffee.spot); plot(sp.futures$Settle); plot(gold.futures$Settle)
Jan1982
Feb1990
Feb1998
Feb2006
Feb2014
2040
6080
100
120
oil.spot
May 012007
Nov 012010
May 012014
Oct 312017
1.0
1.5
2.0
2.5
3.0
coffee.spot
Apr 211982
Jan 021992
Jan 022002
Jan 032012
050
010
0015
0020
0025
00
sp.futures$Settle
Dec 311974
Jan 041988
Jan 042000
Jan 032012
500
1000
1500
gold.futures$Settle
31 / 71
Reading large files
I Files imported in R are stored in the memory of the program
I The physical limit to the file size that can imported is determined by theRAM of your machine (2, 4, 6GB)
I Reading large files can be time consuming when using the base functions
I Two packages are available to help with this task:I readr (function read_csv())I data.table (function fread())
I I will perform a speed comparison for the three functions using a commonbenchmark
32 / 71
I The benchmark for the comparison is obtained from the Center for Researchin Security Prices (CRSP) at the University of Chicago. The variables in thedataset are:
I PERMNO: identification number for each companyI date: date in format 2015/12/31I EXCHCD: exchange codeI TICKER: company tickerI COMNAM: company nameI CUSIP: another identification number for the securityI DLRET: delisting returnI PRC: priceI RET: returnI SHROUT: share oustandingI ALTPRC: alternative price
I The observations are all companies listed in the NYSE, NASDAQ, andAMEX from January 1985 until December 2016 at the monthly frequencyfor a total of 3,627,236 observations and 16 variable. The size of the file is328Mb
33 / 71
read_csv() from readr package
I The command Sys.time() reads the current time that is assigned tostart.time
I The time to perform the operation is calculated as the difference betweenthe Sys.time() and the start.time
start.time <- Sys.time()crsp <- read.csv("crsp_eco4051_jan2017.csv", stringsAsFactors = FALSE)end.csv <- Sys.time() - start.time
Time difference of 39.204 secs
I and for the read_csv() function?
library(readr)start.time <- Sys.time()crsp <- read_csv("crsp_eco4051_jan2017.csv")end_csv <- Sys.time() - start.time
Time difference of 5.5403 secs
I read_csv() is 7.1 times faster than read.csv()
34 / 71
fread() from data.table package
I Another function to read data fast is freadI The arguments:
I data.table: whether output will be a regular data frame or adata.table frame (TRUE/FALSE)
I showProgress: whether partial info about the percentage loadedshould be printed (TRUE/FALSE)
start.time <- Sys.time()crsp <- data.table::fread("crsp_eco4051_jan2017.csv",
data.table=FALSE,showProgress = FALSE)
end.fread <- Sys.time() - start.time
Time difference of 3.1622 secs
I fread is 12 times faster than read.csv() and 1.8 times faster thanread_csv()
35 / 71
Create returns
I Very often we need to create new variables that are transformations ofexisting variables
I One example is to calculate the return of an asset, that is:
1. Simple return: Rt = (Pt − Pt−1)/Pt−12. Logarithmic return: rt = log(Pt) − log(Pt−1)
I In macro this transformation is typically called the growth rateI The transformation is easily done in R using the lag() and diff() functions
GSPC <- getSymbols("^GSPC", from="1990-01-01", auto.assign = FALSE)GSPC$ret.simple <- 100 * (Ad(GSPC) - lag(Ad(GSPC), 1)) / lag(Ad(GSPC),1)GSPC$ret.log <- 100 * (log(Ad(GSPC)) - lag(log(Ad(GSPC)), 1))GSPC$ret.simple <- 100 * diff(Ad(GSPC)) / lag(Ad(GSPC), 1)GSPC$ret.log <- 100 * diff(log(Ad(GSPC)))head(GSPC)
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted ret.simple ret.log1990-01-02 353.40 359.69 351.98 359.69 162070000 359.69 NA NA1990-01-03 359.69 360.59 357.89 358.76 192330000 358.76 -0.25855 -0.258891990-01-04 358.76 358.76 352.89 355.67 177000000 355.67 -0.86130 -0.865031990-01-05 355.67 355.67 351.35 352.20 158530000 352.20 -0.97562 -0.980411990-01-08 352.20 354.24 350.54 353.79 140110000 353.79 0.45145 0.450431990-01-09 353.83 354.17 349.61 349.62 155210000 349.62 -1.17867 -1.18567
36 / 71
I If we want to transform all the columns of a data frame or xts object we cansimply do the operation on the object rather than the variable
# "^GSPC" = S&P 500 Index, "^N225" = Nikkei 225, "^STOXX50E" = EURO STOXX 50data <- getSymbols(c("^GSPC", "^N225", "^STOXX50E"), from="2000-01-01")price <- merge(Ad(GSPC), Ad(N225), Ad(STOXX50E))ret <- 100 * diff(log(price))tail(ret, 5)
GSPC.Adjusted N225.Adjusted STOXX50E.Adjusted2018-01-19 0.437565 0.187892 NA2018-01-22 0.803436 0.034728 0.443242018-01-23 0.217200 1.284195 0.191072018-01-24 -0.056013 -0.763018 -0.794762018-01-25 NA -1.139636 NA
37 / 71
Elegant graphics: ggplot2 package
I The base plotting functions are easy to use and convenient for quick plottingI However, they lack elegance and it is difficult to produce high-quality
graphicsI The package ggplot2 offers an alternative set of function to make graphs
I There are two ways of producing plots with ggplot2:
1. qplot() is a wrapper function similar to plot() that uses underlyingggplot2 plotting functions
2. Using the grammar of graphics that is composed of:I ggplot(): creating the graph and speciying the data frame that
contains the variables to plotI geom_xxx(): the type of plot that is needed; point, line,
histogram, boxplot, etc (see list here)I aes(): the x and y variables to plotI theme(): the overall look of the plot; theme_bw(),
theme_classic(), theme_dark() etc.
38 / 71
I ggplot2 does not recognize the time series properties of xts objects and wehave to specify the x axis
I The time(GSPC) command is used to extract the date associated with eachobservation
GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)names(GSPC) <- sub(’GSPC.’,"", names(GSPC))library(ggplot2)qplot(time(GSPC), GSPC$Close, geom="line")
1000
2000
1990 2000 2010
time(GSPC)
GS
PC
$Clo
se
39 / 71
I ggplot2 interacts best with data framesI We can extract a data frame from the xts object by:
I creating a new variable that represents the DateI use coredata() to extract the data from the xts object
GSPC.df <- data.frame(Date = time(GSPC), coredata(GSPC))qplot(Date, Close, data = GSPC.df, geom = "line")
1000
2000
1990 2000 2010
Date
Clo
se
40 / 71
I We can produce the same graph using the ggplot2 grammar as in theexample below
I Notice that we can assign a ggplot to an object (called myplot) that can beused later and altered by only changing specific aspects
myplot <- ggplot(GSPC.df) + geom_line(aes(x = Date, y = Close))myplot
1000
2000
1990 2000 2010
Date
Clo
se
41 / 71
I The plots can be customized with themes, line colors and types, label namesetc
I The par(mfrow=c(2,2)) does not work with ggplot and we should use thegrid.arrange() function from package gridExtra
plot1 <- ggplot(GSPC.df, aes(Date, Adjusted)) + geom_line(color="darkgreen")plot2 <- plot1 + theme_bw()plot3 <- plot2 + theme_classic() + labs(x="", y="Index", title="S&P 500")plot4 <- plot3 + geom_line(color="darkorange") + geom_smooth(method="lm") +
theme_dark() + labs(subtitle="Period: 1985/2016", caption="Source: Yahoo")library(gridExtra)grid.arrange(plot1, plot2, plot3, plot4, ncol=2)
1000
2000
1990 2000 2010
Date
Adj
uste
d
1000
2000
1990 2000 2010
DateA
djus
ted
1000
2000
1990 2000 2010
Inde
x
S&P 500
0
1000
2000
1990 2000 2010
Inde
x
Period: 1985/2016S&P 500
Source: Yahoo
42 / 71
I A scatter plot of two variables can be easily produced in ggplot2
data <- getSymbols(c("^GSPC", "^N225"), from="1990-01-01")price <- merge(Ad(to.monthly(GSPC)), Ad(to.monthly(N225)))ret <- 100 * diff(log(price))GN.df <- data.frame(Date=time(price), Year = year(time(price)), coredata(merge(price,ret)))names(GN.df) <- c("Date","Year", "SP", "NIK", "SPret","NIKret")
plot1 <- ggplot(GN.df, aes(NIKret, SPret)) + geom_point() +geom_vline(xintercept = 0) + geom_hline(yintercept = 0)
plot2 <- plot1 + geom_smooth(method="lm", se=FALSE) + theme_bw() +labs(x="NIKKEI", y="SP500")
grid.arrange(plot1, plot2, ncol=2)
−20
−10
0
10
−20 −10 0 10 20
NIKret
SP
ret
−20
−10
0
10
−20 −10 0 10 20
NIKKEI
SP
500
43 / 71
I The strength of ggplot2 is to make easier to produce sophisticated graphicsI For example: if we want to have the dots in the scatter plot depend on a
variable (e.g., Year) this can be done easily by adding the argumentcolor=Year in the aesthetics
GN.df$Year <- year(time(price))plot1 <- ggplot(GN.df, aes(NIKret, SPret, color=Year)) + geom_point() +
geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")plot2 <- ggplot(GN.df, aes(NIKret, SPret, color=factor(Year))) + geom_point() +
geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")grid.arrange(plot1, plot2, ncol=2)
−20
−10
0
10
−20 −10 0 10 20
1990
2000
2010
Year
−20
−10
0
10
−20 −10 0 10 20
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
44 / 71
Boxplot
I A boxplot provides a graphical representation of the distribution of the dataI Box-whisker plot: the box represents the 25%, 50%, and 75% quantile and
the whiskers extend to the highest value within 1.5 times the interquartilerange; the dots are the min/max
I In the graph below the boxplot is plotted for each year separately
ggplot(GN.df, aes(factor(Year), SPret)) + geom_boxplot() +theme(axis.text.x = element_text(angle = 90))
−20
−10
0
10
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
factor(Year)
SP
ret
45 / 71
Summary statistics
I The function summary() provides a few summary statistics of thedistribution of a data object
I Of course, the variable needs to be of type numerical
summary(GSPC$ret.simple)
Index ret.simpleMin. :1990-01-02 Min. :-9.03501st Qu.:1996-12-26 1st Qu.:-0.4399Median :2004-01-07 Median : 0.0530Mean :2004-01-07 Mean : 0.03533rd Qu.:2011-01-13 3rd Qu.: 0.5541Max. :2018-01-24 Max. :11.5800
NA's :1
summary(GSPC$ret.log)
Index ret.logMin. :1990-01-02 Min. :-9.46951st Qu.:1996-12-26 1st Qu.:-0.4409Median :2004-01-07 Median : 0.0530Mean :2004-01-07 Mean : 0.02923rd Qu.:2011-01-13 3rd Qu.: 0.5525Max. :2018-01-24 Max. :10.9572
NA's :1
46 / 71
I Package fBasics provides the basicStats() function with a morecomprehensive set of descritive statistics compared to summary()
fBasics::basicStats(GSPC$ret.log)
ret.lognobs 7072.000000NAs 1.000000Minimum -9.469512Maximum 10.9571971. Quartile -0.4409193. Quartile 0.552538Mean 0.029210Median 0.052970Sum 206.545022SE Mean 0.013172LCL Mean 0.003390UCL Mean 0.055031Variance 1.226761Stdev 1.107592Skewness -0.252791Kurtosis 9.004323
47 / 71
Covariance and correlation between two or more assets
I When we have several variables or assets the first question that arises in theanalysis is whether they co-move
I Dependence is measured using the covariance and the correlation
I Do the S&P 500 and NIKKEI move togheter?
Ret <- subset(GN.df, select=c("SPret","NIKret"))cov(Ret, use=’complete.obs’)
SPret NIKretSPret 16.960 13.446NIKret 13.446 38.731
cor(Ret, use=’complete.obs’)
SPret NIKretSPret 1.00000 0.52463NIKret 0.52463 1.00000
48 / 71
Plotting the data distribution
I A very useful tool to explore the distribution of the data is the histogramthat represents an estimator of the underlying (population) of the data. It isuseful to assess (visually) the characteristics of the data, such as normality,fat tails, asymmetry
hist(GSPC$ret.log, breaks=50, xlab="", main="") # base functionqplot(ret.log, data=GSPC, geom="histogram", bins=50) # ggplot function
Fre
quen
cy
−10 −5 0 5 10
050
010
0015
00
0
500
1000
1500
−10 −5 0 5 10
ret.log
coun
t
49 / 71
I We can overlap a non-parametric estimate to the histogram which representsa smooth line that goes through the histogram bars
# base functionhist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)lines(density(GSPC$ret.log,na.rm=TRUE),col=2,lwd=2)box()# ggplot functionggplot(GSPC, aes(ret.log)) +
geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +geom_density(color="red", size=1.2) +theme_bw()
Return
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.2
0.4
0.6
−10 −5 0 5 10
ret.log
dens
ity
50 / 71
I Or compare the histogram to a distribution (e.g., normal)
# base functionhist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)curve(dnorm(x, mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)),
from=-10, to=10, add=TRUE, col="red",lwd=2)box()# ggplot functionggplot(GSPC, aes(ret.log)) +
geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +stat_function(fun = dnorm, colour = "red",
args = list(mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)), size=1.2) +theme_bw()
Return
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.2
0.4
0.6
−10 −5 0 5 10
ret.log
dens
ity
51 / 71
Dates and times in R
I We already used the command as.Date() to define a string to be of typeDate
I The default format of a date in R is 2011-07-17I If in your dataset a date is specified in a different way, you need to help R
read it by specifying the format=I The syntax of the format is: %d for numerical day, %a and %A
abbreviated/unabbreviated weekday, %m numerical month, %b and %B forabbreviated/unabbreviated month, %y and %Y for 2/4 digit year
as.Date("2011-07-17") # default, no need to specify formatas.Date("July 17, 2011", format="%B %d,%Y")as.Date("Monday July 17, 2011", format="%A %B %d,%Y")as.Date("17072011", format="%d%m%Y")as.Date("11@17#07", format="%y@%d#%m")
[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17"
52 / 71
I One operation we might want to do with dates is to calculate the differencebetween two dates
I This can be done by subtracting two dates or using the difftime() functionthat allows also to specify the unit of time
date1 <- as.Date("July 17, 2011", format="%B %d,%Y")date2 <- Sys.Date()date2 - date1difftime(date2, date1, units="secs")difftime(date2, date1, units="days")difftime(date2, date1, units="weeks")
Time difference of 2384 daysTime difference of 205977600 secsTime difference of 2384 daysTime difference of 340.57 weeks
53 / 71
Time
I In addition to the date, we might need to specify the time of the dayI This is useful when dealing with intra-day data, such as the FX data that
we discussed earlier and shown below
data.hf <- data.table::fread(’USDJPY-2016-12.csv’,col.names=c("Pair","Date","Bid","Ask"),colClasses=c("character","character","numeric","numeric"),data.table=FALSE, showProgress = FALSE)
Pair Date Bid Ask1 USD/JPY 20161201 00:00:00.041 114.68 114.692 USD/JPY 20161201 00:00:00.042 114.68 114.693 USD/JPY 20161201 00:00:00.186 114.68 114.694 USD/JPY 20161201 00:00:00.188 114.68 114.695 USD/JPY 20161201 00:00:00.189 114.69 114.706 USD/JPY 20161201 00:00:00.223 114.69 114.707 USD/JPY 20161201 00:00:00.343 114.69 114.708 USD/JPY 20161201 00:00:00.347 114.69 114.709 USD/JPY 20161201 00:00:00.403 114.69 114.6910 USD/JPY 20161201 00:00:00.415 114.69 114.69
54 / 71
I To work with time we can use two functions:I strptime()I as.POSIXlt()
I Both require to specify the format of the date and time partI The format of the time is: %H hour (out of 24), %M minute, %S seconds, and
%OS fractional seconds
strptime("20161201 01:00", format="%Y%m%d %H:%M")strptime("20161201 00:00:01", format="%Y%m%d %H:%M:%S")strptime("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")##date1 <- as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")date2 <- strptime("20161201 01:15:00.041", format="%Y%m%d %H:%M:%OS")date2 - date1difftime(date2, date1, unit="secs")
[1] "2016-12-01 01:00:00 EST"[1] "2016-12-01 00:00:01 EST"[1] "2016-12-01 00:00:00 EST"[1] "2016-12-01 00:00:00 EST"Time difference of 1.25 hoursTime difference of 4500 secs
55 / 71
lubridate package
I This package makes it easier to define dates by having dedicated functionsthat do the as.Date and format together
I These functions are:I ymd: for dates in the format year, month, dayI dmy: dates with day, month, year formatI mdy: when the format is month, day, yearI ymd_hm: in addition to the date the time is provided in hour and
minute (the date part can be changed to other formats)I ymd_hms: the time format is hour, minute, and seconds
library(lubridate)ymd("20110717")ymd("2011/07/17")ymd_hm("20110717 01:00")ydm_hms("20111707 00:00:00.041")
[1] "2011-07-17"[1] "2011-07-17"[1] "2011-07-17 01:00:00 UTC"[1] "2011-07-17 00:00:00 UTC"
56 / 71
I The package provides functions that makes it easy to extract the year,month, day, day of the week/month/year, minute, second etc
mydate <- ydm_hms("20111707 00:00:00.041")year(mydate)month(mydate)day(mydate)wday(mydate, label=T, abbr=FALSE)minute(mydate)second(mydate)
[1] 2011[1] 7[1] 17[1] SundayLevels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday[1] 0[1] 0.041
57 / 71
dplyr package
I The package dplyr provides functions to manipulate data framesI Data analysis requires a lot of data manipulation and this package has made
the task significantly easierI dplyr has 5 main verbs:
I mutate: to create new variablesI select: to select columns of the data frameI filter: to select rows based on a criterionI group_by()/summarize: uses a function to summarize
columns/variables in one valueI arrange: to order a data frame based on one or more variables
I dplyr is typically used with the %>% piping operator (read then)I %>% is useful to write compact code when we are not interested in
using or storing the intermediate results
58 / 71
mutate and select
I mutate() to create new variablesI select() to select existing variables
library(dplyr)library(lubridate)
GSPC.df <- mutate(GSPC.df, range = 100 * log(High/Low),ret.c2c = 100 * log(Adjusted / lag(Adjusted)),year = year(Date),month = month(Date),wday = wday(Date, label=T, abbr=F))
tail(GSPC.df, 2)
Date Open High Low Close Volume Adjusted range ret.c2c year month wday8334 2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1 0.41073 0.217200 2018 1 Tuesday8335 2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5 0.99195 -0.056013 2018 1 Wednesday
Date year month wday range ret.c2c8334 2018-01-23 2018 1 Tuesday 0.41073 0.2172008335 2018-01-24 2018 1 Wednesday 0.99195 -0.056013
59 / 71
filter()
I the filter() function is used to select specific rows of the data frame
filter(GSPC.df, wday == "Tuesday") %>% head(7)
Date Open High Low Close Volume Adjusted range ret.c2c year month wday1 1985-01-08 164.24 164.59 163.91 163.99 92110000 163.99 0.41400 -0.152332 1985 1 Tuesday2 1985-01-15 170.51 171.82 170.40 170.81 155300000 170.81 0.82988 0.175790 1985 1 Tuesday3 1985-01-22 175.23 176.63 175.14 175.48 174800000 175.48 0.84715 0.142568 1985 1 Tuesday4 1985-01-29 177.40 179.19 176.58 179.18 115700000 179.18 1.46727 0.998381 1985 1 Tuesday5 1985-02-05 180.35 181.53 180.07 180.61 143900000 180.61 0.80753 0.144058 1985 2 Tuesday6 1985-02-12 180.51 180.75 179.45 180.56 111100000 180.56 0.72182 0.027697 1985 2 Tuesday7 1985-02-19 181.60 181.61 180.95 181.33 90400000 181.33 0.36408 -0.148791 1985 2 Tuesday
60 / 71
group_by()/summarize()
I A common operation in data analysis is to group observations based on acertain characteristic and apply a certain function to each group
I Example: calculate the average/min/max return by day of the week
GSPC.df %>% group_by(wday) %>%summarize(AV.RET = mean(ret.c2c, na.rm=T),
MIN.RET = min(ret.c2c, na.rm=T),MAX.RET = max(ret.c2c, na.rm=T))
# A tibble: 5 x 4wday AV.RET MIN.RET MAX.RET
<ord> <dbl> <dbl> <dbl>1 Monday 0.010565 -22.8997 10.95722 Tuesday 0.066759 -5.9108 10.24573 Wednesday 0.054635 -9.4695 8.70894 Thursday 0.016346 -7.9224 6.69235 Friday 0.019671 -7.0082 6.1328
61 / 71
I The grouping can also be done on two variables, for example month and year
GSPC.df %>% group_by(month, year) %>%summarize(AV.RET = mean(ret.c2c, na.rm=T),
MIN.RET = min(ret.c2c, na.rm=T),MAX.RET = max(ret.c2c, na.rm=T))
# A tibble: 397 x 5# Groups: month [?]
month year AV.RET MIN.RET MAX.RET<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1985 0.393875 -0.54228 2.25662 1 1986 0.010744 -2.76472 1.48433 1 1987 0.589429 -1.40073 2.30244 1 1988 0.198181 -7.00824 3.52315 1 1989 0.327143 -0.87157 1.48546 1 1990 -0.324090 -2.61989 1.87107 1 1991 0.184905 -1.74726 3.66428 1 1992 -0.091477 -1.11960 1.46159 1 1993 0.035106 -0.87605 0.8903
10 1 1994 0.152304 -0.58097 1.1363# ... with 387 more rows
62 / 71
Does volatility of the S&P 500 vary over time?
I dplyr functions can be very useful to write relatively lengthy operations in acompact and readable manner
I Assume we want to calculate the average volatility by year using theintra-day range as a proxy for volatility
GSPC.df %>% mutate(range = 100 * log(High/Low),year = year(Date)) %>%
group_by(year) %>%summarize(av.range = mean(range, na.rm=T)) %>%ggplot(., aes(year, av.range)) + geom_line(color="steelblue4", size=1.3) +theme_bw() + labs(x="", y="")
1
2
1990 2000 2010
63 / 71
Creating functions in R
I The advantage of a programming language is that you have the flexibility towrite your own functions. This can be useful when:
1. No package provides pre-programmed functions to perform theanalysis you want to conduct
2. The task is very complex and you prefer to break it down in smallertasks that make the code easier to read, interpret, and test.
3. Once you write a function, you can use it again in future analysis
64 / 71
I A function is a set of operations applied to some data
I The syntax is as follows:
myfunction <- function(inputs){
## operations#
return(output)}
I The function() can include several argumentsI The return() is the output of the function that can include only one object,
although that object might include several elementsI To call the function you type in R the name of the function with the
appropriate arguments: myfunction()
65 / 71
A function to calculate the sample average
I R provides the mean() function to calculate the sample average
mean(GSPC$ret.log, na.rm=T)
[1] 0.02921
I As an illustration, let’s write a function that calculates the sample average:R̄ =
∑Tt=1 Rt/T
I In this case the set of operations to perform is quite simple:
1. sum the values of the time series2. divide by the total number of observations
66 / 71
I Below is the code that defines a new function called mymean()
# Y is the input, Ybar the output of the functionmymean <- function(Y){
Y = na.omit(Y)Ybar <- sum(Y) / length(Y)return(Ybar)
}
I Let’s compare the results:
mean(GSPC$ret.log, na.rm=T)
[1] 0.02921
mymean(GSPC$ret.log)
[1] 0.02921
67 / 71
Loops in R
I Loops are a useful tool when you want to perform the same set of operationson several time series or datasets
I A common loop is the for loop which has the following syntax:
for (i in 1:N){
# write your commands here}
I The loop iterates through the values of i from 1 to NI Example:
for (i in 1:3){
print(i)}
[1] 1[1] 2[1] 3
68 / 71
mysum() function using a for loop
mysum <- function(Y){
Y = na.omit(Y)N = length(Y) # define N as the number of elements of YsumY = 0 # initialize the variable that will store the sum of Y
for (i in 1:N){
sumY = sumY + as.numeric(Y[i]) # current sum is equal to previous sum} # plus the i-th value of Yreturn(sumY) # as.numeric(): makes sure to transform
} # from other classes to a number
mysum(GSPC$ret.log)
[1] 206.55
sum(GSPC$ret.log, na.rm=T)
[1] 206.55
69 / 71
A simulation exercise
I Simulations are used to evaluate some quantities (e.g., the price of an optionor an estimator) based on large number of samples generated from a certaindistribution
I The recipe works as follows:
1. generate random values from a model2. calculate the quantity of interest3. repeat 1 and 2 many times
I Example: we want to evaluate if the distribution of the sample mean isN(µ, σ2/N), where µ is the population mean, σ2 the population variance,and N the sample size
I In the example in the following slide we generate data from a normaldistribution with mean 0 and standard deviation 2
70 / 71
S = 5000 # set the number of simulationsN = 1000 # set the length of the samplemu = 0 # population meansigma = 2 # population standard deviation
Ybar = vector(’numeric’, S) # create an empty vector of S elements# to store the t-stat of each simulation
for (i in 1:S){
Y = rnorm(N, mu, sigma) # Generate a sample of length NYbar[i] = mean(Y) # store the t-stat
}ggplot(data = data.frame(Ybar=Ybar), aes(Ybar)) +
geom_histogram(aes(y = ..density..), color="red", fill="lightsalmon", bins = 40) +stat_function(fun = dnorm, args = list(mean = 0, sd = sigma/sqrt(N)), color="seagreen",size=1.2) +theme_bw() + xlim(c(-0.3, 0.3))
0
2
4
6
−0.2 0.0 0.2
Ybar
dens
ity
71 / 71