fitting and predicting a time series model

7/27/2019 Fitting and Predicting a Time Series Model

http://slidepdf.com/reader/full/fitting-and-predicting-a-time-series-model 1/14

Fitting and Predicting a Time Series Model(Midterm 2)April 11, 2013

Sueja Goldhahn23185225

[email protected]

Prof. GuntuboyinaIntro to Time Series Analysis

mailto:[email protected]





Summary

As a midterm take-home assignment, the task is to fit the best possible model

to a given time series dataset and predict the outcome for the next year. The time

series dataset was chosen from a group of 5 datasets given for the assignment. The

information of the data is unknown, other than the fact that the data is weekly data

from Google Trends, obtained on March 20, 2013. The dataset consists of 429 data

points, comprised of 8 years and 13 weeks of information from the first week of

January 4, 2004 to March 24, 2012, as seen in Figure 1. The data for the next year,

from March 25, 2012 to March 23, 2013, are to be predicted.

After analyzing the data closely, the model chosen to be the best fit is a

multiplicative seasonal autoregressive moving average model, ()

(). Using this model, the resulting prediction for the next year is given in Figure

11 and 12.

This report will describe the methods used to choose the best model for the

time series dataset, and explain why I believe this model is the best fit to the dataset.

Then I will go on to explain the techniques used to predict the model for the next

year.



Method Used to Fit the Model

The first thing that needs to be observed in the data is trend and variability.

By observing the data, it is very apparent that there is a quadratic trend, along withseasonality. The variability in the data stays constant throughout, indicating that a

transformation of the data would not be necessary. This would suggest that the data

needs to be differenced once to eliminate the quadratic trend, and then differenced

again to eliminate the seasonality. The second time will be differenced with lag

equal to 52, which is the number of weeks in a year.

After being differenced, the trend should now be eliminated, and the data

should look like white noise. The differenced data is displayed as Figure 2. The plot

visibly seems to have no structure. To check for the optimal orders of differencing,

the standard deviation of the data should be considered. A correctly differenced

data should have a small standard deviation. A table of standard deviations for each

order of differencing is displayed in Figure 3, showing that the standard deviation is

indeed the smallest for this order of differencing.

Figure 3

Standard Deviation of

Differenced Data Order 0

Lag 1:

Order 1 Order 2

Order 0 6.782 3.556 6.387

Lag 52: Order 1 3.895 2.145 9.936

Order 2 4.051 3.289 17.475

Once the correct order of differencing is attained, the next step is to look at

the autocorrelation and partial autocorrelation functions to determine the

autoregressive and moving average terms. The autocorrelation function is displayed

in Figure 4, and the partial autocorrelation function is displayed in Figure 5.



Figure 4

Figure 5

Examining the autocorrelation function gives several clues as to what fit to

use in this model. The first thing to note is that the model has a seasonal MA term as

a result of the negative 52nd lag. The 104th lag is significant, with the value at that

point being 0.109, lying above the standard deviation. Also, there is some

asymmetry around these lag points. Hence, the two values to consider in the model

is SMA(2) and SMA(3).



Next thing to note is that lag-1 is significant and the autocorrelations cut off

after lag-1, which is very characteristic of an MA model. It is possible that there also

contains an AR value mixed into the equation due to the shrinking variability in the

autocorrelations as the lags increase.

The partial autocorrelation shows characteristics of an MA model as a result

of the slow decay. From these observations, a few good models to test out are:

() ()

() ()

() ()

() ()

After fitting the data to each of these models, I had chosen two of the best fits

based on their AIC score: Refer to the appendix for the results of each model under

“Results of Each Model .”

() ()

() ()

To see how close the theoretical autocorrelation functions of each model

match up to the actual autocorrelations, I had plotted the two autocorrelations

together in Figure 6-9. This helps to see how well the models fit to the data, and if

they are any good at computing the phi and theta values. To see the phi and theta

values used, as well as the function, please refer to the appendix under “Results of

Each Model .”



Figure 6

Figure 7



Figure 8

Figure 9

The blue points are the autocorrelation from the data, and the red points are the

theoretical autocorrelations. Both models seem to capture the structure of the

autocorrelations, which tells me that the models chosen are a good fit. The second

model, which has a smaller AIC, looks as though the theoretical autocorrelations fit

better than the first model.



Choosing the Best Model for Predicting

To determine which of the two competing models to use, the models must be

tested for how well it predicts the data. That is done through cross-validation. I will

first explain the methodology used in cross-validating the data, and then show the

results of the Cross Validation score.

Given that there are 8 years and 13 weeks in the dataset, Cross-validation is

done through predicting values of those years and taking the sum of squares of

errors in the predictions. It is optimal to predict as many years within the data as

possible. The number of data points required in predicting the model limits the

number of years that I am able to predict. Obviously, I would not be able to predict

the first year of data, as there are no data points to predict from.

The minimum number of years of data needed to predict are 4 years, so I

used the data points 1 through 221 to predict data points from 222 to 273. Then I

used the data points 1 through 273 to predict data points from 274 to 325, and so

on. The accuracy in the predictions is found through comparing the predicted data

to the actual data by taking the difference and using the sums of squares method.

This is done for each of the 4 years that are predicted, and then averaged to attain

the Cross-Validation score. The model with the smallest CV score is the best model

for the dataset, and will be used to predict the next year’s data.

The CV score results are shown in Figure 10 and 11, along with the plots of

the actual data (in black) and predicted data (in red). It is evident that the models

are able to predict the data accurately. Although the second model has a slightly



better AIC score, the CV score is significantly larger than the CV score for the first

model. Therefore, the best model for this data set is the first model,

ARIMA(1,1,1)X(0,1,3). This model will be used to forecast the next year’s data in the

following section.

Figure 10

ARIMA(1,1,1)X(0,1,3)

CV = 479.9672

Figure 11

ARIMA(1,1,2)X(0,1,3)

CV = 495.4493



The Forecast

Now that the best model for the dataset is chosen, the next year will be

forecasted using the model. The data, including the forecast, is displayed in Figure

11. The forecasted data visually look accurate. The 95% confidence interval of the

forecasted data is exhibited in Figure 12.

Figure 11

Figure 12



# APPENDIX

# data: d1 #

plot(d1,type="l",main = "Time Series Data", ylab="",xlab="Weeks",sub="January 04, 2004 to

March 24, 2012")#Difference for seasonality and trendd = diff(d1, lag = 52)d = diff(d)

#Check to see that there is no trend left in the model

plot(d,type="l",main = "Differenced Data", ylab="",xlab="")

#Check acf and pacf

acf(d, lag.max = 200,main="Autocorrelation Function")pacf(d, lag.max = 200,main="Partial Autocorrelation Function")

#Results of Each Model:

# ARIMA(0,1,1)X(0,1,3)arima(d1, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52))#aic = 1508.32# ma1 sma1 sma2 sma3# -0.6419 -0.3395 0.2599 0.1885#s.e. 0.0515 0.0610 0.0750 0.0770

#ARIMA(0,1,1)X(0,1,2)arima(d1, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2), period = 52))#aic = 1512.92# ma1 sma1 sma2# -0.6290 -0.3430 0.2827#s.e. 0.0513 0.0653 0.0748

#ARIMA(1,1,1)X(0,1,3)arima(d1, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52))#aic = 1504.52# ar1 ma1 sma1 sma2 sma3# 0.2149 -0.7890 -0.3429 0.2711 0.1873

#s.e. 0.0815 0.0547 0.0604 0.0766 0.0772

#ARIMA(1,1,2)X(0,1,3)arima(d1, order = c(1, 1, 2), seasonal = list(order = c(0, 1, 3), period = 52))#aic=1504.34# ar1 ma1 ma2 sma1 sma2 sma3# 0.6451 -1.2347 0.3173 -0.3372 0.2759 0.1881#s.e. 0.1904 0.2061 0.1557 0.0603 0.0758 0.0771



#Theoretical Autocorrelation of ARIMA(1,1,1)X(0,1,3)

#phi and theta values:ph = .2149th = c(-.789, rep(0, 50), -0.3429, -.3429*-.789,

rep(0, 51), 0.2711, .2711*-.789,rep(0,51),.1873,.1873*-.789)

acf(d, lag.max=175,main = "Theoretical Autocorrelation of ARIMA(1,1,1)X(0,1,3)",col="blue")ACF = ARMAacf(ar = ph, ma = th, lag.max = 175)points(x=0:175,y=ACF, col="red",type="h")

pacf(d, lag.max=175,main = "Theoretical Partial Autocorrelation of ARIMA(1,1,1)X(0,1,3)",col="blue")PACF = ARMAacf(ar = ph, ma = th, lag.max = 175, pacf = T)points(x=1:175,y=PACF, col="red",type="h")

#Theoretical Autocorrelation of ARIMA(1,1,2)X(0,1,3)#phi and theta values:ph = .6451th = c(-1.2347,.3173,rep(0,49),-.3372, -.3372*-1.2347,-.3372*.3173,rep(0,49),.2759,

.2759*-1.2347,.2759*.3173,rep(0,49),.1881,.1881*-1.2347,.1881*.3173)

acf(d, lag.max=175,main = "Theoretical Autocorrelation of ARIMA(1,1,2)X(0,1,3)",col="blue")ACF = ARMAacf(ar = ph, ma = th, lag.max = 175)points(x=0:175,y=ACF, col="red",type="h")

pacf(d, lag.max=175,main = "Theoretical Partial Autocorrelation of ARIMA(1,1,2)X(0,1,3)",col="blue")PACF = ARMAacf(ar = ph, ma = th, lag.max = 175, pacf = T)points(x=1:175,y=PACF, col="red",type="h")

#Cross-validation

#model 1: ARIMA(1,1,1)X(0,1,3)

pred = rep(0,208)CV = rep(0,4)

for (i in 0:3){k = 221 + i * 52nd1 = d1[1:k]d1fit = arima(nd1, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52))d1fc = predict(d1fit, n.ahead = 52)pred[(1 + i * 52):((i + 1) * 52)] = as.numeric(d1fc$pred)CV[ i + 1 ] = sum((d1[(k + 1):(221 + (i + 1) * 52)]-as.numeric(d1fc$pred))^2)

}mean(CV)



plot( c(222:429), d1[222:429], type = "l",

main = "Comparison of Prediction to Actual Data", xlab = "Time", ylab = "")points( c(222:429), pred, col = "red",type="l")

#model 2: ARIMA(1,1,2)X(0,1,3)

pred = rep(0,208)CV = rep(0,4)

for (i in 0:3){k = 221 + i * 52nd1 = d1[1:k]d1fit = arima(nd1, order = c(1, 1, 2), seasonal = list(order = c(0, 1, 3), period = 52))d1fc = predict(d1fit, n.ahead = 52)pred[(1 + i * 52):((i + 1) * 52)] = as.numeric(d1fc$pred)CV[ i + 1 ] = sum((d1[(k + 1):(221 + (i + 1) * 52)]-as.numeric(d1fc$pred))^2)

}mean(CV)

plot( c(222:429), d1[222:429], type = "l",main = "Comparison of Prediction to Actual Data", xlab = "Time", ylab = "")

points( c(222:429), pred, col = "red",type="l")

#The Forecast Using Model ARIMA(1,1,1)X(0,1,3)

d1fit = arima(d1, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52))

d1fc = predict(d1fit, n.ahead = 52)

U = d1fc$pred + 2*d1fc$seL = d1fc$pred - 2*d1fc$se

newx = 1:481newy = c(d1, d1fc$pred)

plot(newx, newy, type = "l",main="Data Including the Forecast",xlab="Weeks",ylab="",sub="January 04, 2012 to March 23, 2013")

plot(430:481, d1fc$pred, type = "l",ylim = c(60,110),

main="Forecast and 95% Confidence Interval for the Next Year",xlab="Weeks", ylab="",sub="March 25, 2012 to March 23, 2013")points(newx[430:481], d1fc$pred + 2*d1fc$se, col = "blue", type = "l")points(newx[430:481], d1fc$pred - 2*d1fc$se, col = "blue", type = "l")

fitting and predicting a time series model

Documents