library(tidyverse)
library(tidyfinance)
library(tsibble)
library(fable)
library(feasts)
library(scales)
library(fabletools)
Day 1 Part 4: ARIMA Modelling
Load Libraries
Load necessary libraries for data manipulation, finance data, and time-series analysis. These libraries will help us handle time series data, conduct ARIMA modeling, and visualize results effectively.
ARIMA Processes
ARIMA (AutoRegressive Integrated Moving Average) processes are commonly used in time series analysis to model and predict future values by combining autoregressive (AR), moving average (MA), and differencing components. The AR part captures the influence of past values, the MA part captures the influence of past forecast errors, and differencing ensures the series is stationary. ARIMA models are particularly useful for understanding trends and cyclic behaviors in time series data.
An autoregressive model of order p, denoted as AR(p), is written as:
y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t,
where \varepsilon_t is white noise. This structure resembles a multiple regression with lagged values of y_t as predictors.
In contrast, a moving average model of order q, denoted MA(q), uses past forecast errors as predictors:
y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \cdots + \theta_q \varepsilon_{t-q},
where \varepsilon_t is white noise. Since we do not directly observe the values of \varepsilon_t, it’s not a true regression in the usual sense.
When differencing is combined with AR and MA components, we get a non-seasonal ARIMA model, where “integration” refers to the differencing process. The full model is written as:
y'_t = c + \phi_1 y'_{t-1} + \cdots + \phi_p y'_{t-p} + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} + \varepsilon_t,
where y'_t is the differenced series. This is an ARIMA(p, d, q) model, where:
- p is the order of the autoregressive part,
- d is the degree of differencing required,
- q is the order of the moving average part.
The ARIMA model framework is powerful for capturing both trend and cyclical behaviors in time series data, making it a fundamental tool in forecasting.
Simulate ARIMA Processes
Here we simulate ARIMA processes to illustrate their characteristics.
set.seed(42)
<- 10^2
n <-
innovations tibble(
time = 1:n,
e = rnorm(n),
e_lag = lag(e, default = 0),
%>%
) as_tsibble(index = time)
innovations
# A tsibble: 100 x 3 [1]
time e e_lag
<int> <dbl> <dbl>
1 1 1.37 0
2 2 -0.565 1.37
3 3 0.363 -0.565
4 4 0.633 0.363
5 5 0.404 0.633
6 6 -0.106 0.404
7 7 1.51 -0.106
8 8 -0.0947 1.51
9 9 2.02 -0.0947
10 10 -0.0627 2.02
# ℹ 90 more rows
Plot Innovations
Visualize the random innovations generated for the ARIMA processes. This helps us understand the underlying noise component that drives the stochastic processes in ARIMA models.
%>% autoplot(e) innovations
%>%
innovations ggplot(aes(e)) +
geom_histogram(aes(y = after_stat(density))) +
stat_function(fun = dnorm, color = "red")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Generate ARIMA Processes
Generate different ARIMA processes using the simulated innovations. We create ARMA(1,0), ARMA(0,1), and ARMA(1,1) processes to illustrate how varying the parameters affects the behavior of the time series.
<-
arma_processes %>%
innovations mutate(
random_walk = accumulate(
e, + e)
(\(y_lag, e) y_lag
),arma10 = accumulate(
e, 0.9*y_lag + e)
(\(y_lag, e)
),arma01 = -0.9*e_lag + e,
arma11 = accumulate2(
e,
e_lag,.f = (\(y_lag, e, e_lag) 0.9 * y_lag + 0.9 * e_lag + e),
.init = 0
-1]
)[
)
arma_processes
# A tsibble: 100 x 7 [1]
time e e_lag random_walk arma10 arma01 arma11
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.37 0 1.37 1.37 1.37 1.37
2 2 -0.565 1.37 0.806 0.669 -1.80 1.90
3 3 0.363 -0.565 1.17 0.965 0.871 1.57
4 4 0.633 0.363 1.80 1.50 0.306 2.37
5 5 0.404 0.633 2.21 1.76 -0.165 3.11
6 6 -0.106 0.404 2.10 1.47 -0.470 3.05
7 7 1.51 -0.106 3.61 2.84 1.61 4.16
8 8 -0.0947 1.51 3.52 2.46 -1.46 5.01
9 9 2.02 -0.0947 5.54 4.23 2.10 6.45
10 10 -0.0627 2.02 5.47 3.75 -1.88 7.56
# ℹ 90 more rows
Display ARIMA Components
Visualize the generated ARIMA components using their autocorrelation functions (ACF) and partial autocorrelation functions (PACF). This helps us understand the persistence, memory, and dependence structures within each process.
In time series analysis, autocorrelation quantifies the relationship between values of a series at different time lags, offering insights into how past values influence future values within the same series. Unlike simple correlation, which measures the linear relationship between two distinct variables, autocorrelation assesses dependencies within a time series by examining lagged values.
Each lag in a time series has an associated autocorrelation coefficient, with r_1 representing the correlation between y_t and y_{t-1}, r_2 capturing the relationship between y_t and y_{t-2}, and so forth. The formula to calculate the autocorrelation at lag k is:
r_k = \frac{\sum_{t=k+1}^T (y_t - \bar{y})(y_{t-k} - \bar{y})}{\sum_{t=1}^T (y_t - \bar{y})^2},
where T denotes the length of the time series, and \bar{y} is the mean of the series. The set of autocorrelation coefficients for various lags forms the autocorrelation function (ACF), which helps identify patterns and dependencies in the data.
However, an ACF plot can sometimes be misleading. For example, if y_t and y_{t-1} are correlated, then y_t and y_{t-2} may also appear correlated—not due to new information in y_{t-2}, but simply because both are related to y_{t-1}. This can create a chain of dependencies that may not accurately reflect the unique influence of each lag.
To address this issue, we use partial autocorrelations, which measure the relationship between y_t and y_{t-k} after controlling for the influence of intermediate lags (1 through k-1). The first partial autocorrelation is the same as the first autocorrelation, as there are no intervening lags to account for. Each subsequent partial autocorrelation isolates the direct effect of that specific lag, removing the influence of shorter lags. In an autoregressive model, each partial autocorrelation can be estimated as the final coefficient in an AR model of that order. Specifically, the kth partial autocorrelation, \alpha_k, corresponds to the coefficient \phi_k in an AR(k) model.
This refined approach gives us the partial autocorrelation function (PACF), which provides a clearer picture of the direct relationships between values at specific lags, making it a valuable tool for model selection and interpretation in time series analysis.
%>%
arma_processes gg_tsdisplay(
arma10, plot_type = "partial",
lag_max = 12
)
%>%
arma_processes gg_tsdisplay(
arma01, plot_type = "partial",
lag_max = 12
)
%>%
arma_processes gg_tsdisplay(
arma11, plot_type = "partial",
lag_max = 12
)
For data following an ARIMA model with either AR terms (ARIMA(p, d, 0)) or MA terms (ARIMA(0, d, q)), the ACF and PACF plots can assist in identifying the appropriate values of p or q. When both parameters are positive, however, the plots may not distinctly guide the selection of these values.
To identify an ARIMA(p, d, 0) model, check the ACF and PACF plots of the differenced series for these patterns:
- The ACF shows an exponentially decaying or sinusoidal trend.
- The PACF displays a significant spike at lag p, with minimal activity beyond this point.
In contrast, data may fit an ARIMA(0, d, q) model if the ACF and PACF of the differenced series reveal:
- An exponentially decaying or sinusoidal pattern in the PACF.
- A distinct spike at lag q in the ACF, with no significant lags thereafter.
These patterns help differentiate between autoregressive and moving average components in the time series, making model identification clearer.
SP500 Daily Returns
Download and analyze SP500 daily returns. We use the S&P 500 index to study real-world time series data, which is useful for understanding the practical application of ARIMA modeling.
<-
sp500_returns download_data(
"stock_prices",
symbols = "^GSPC",
start = "2010-01-01",
end = "2019-12-31"
%>%
) rename(price = adjusted_close) %>%
select(symbol, date, price) %>%
mutate(
date = row_number(),
lprice = log(price),
lreturn = difference(lprice)
%>%
) remove_missing() %>%
as_tsibble(index = date) %>%
glimpse()
Warning: Removed 1 row containing missing values or values outside the scale
range.
Rows: 2,514
Columns: 5
$ symbol <chr> "^GSPC", "^GSPC", "^GSPC", "^GSPC", "^GSPC", "^GSPC", "^GSPC",…
$ date <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
$ price <dbl> 1136.52, 1137.14, 1141.69, 1144.98, 1146.98, 1136.22, 1145.68,…
$ lprice <dbl> 7.035726, 7.036272, 7.040265, 7.043142, 7.044888, 7.035462, 7.…
$ lreturn <dbl> 0.0031108320, 0.0005453719, 0.0039932184, 0.0028775831, 0.0017…
Explore Log Prices
Explore and visualize the log-transformed prices of the S&P 500. The log transformation helps stabilize variance and allows us to focus on percentage changes.
%>%
sp500_returns gg_tsdisplay(
lprice, plot_type = "partial",
lag_max = 12
)
%>%
sp500_returns autoplot(.vars = lprice)
Perform Stationarity Tests
Conduct stationarity tests on the log prices to determine if differencing is required. Stationarity is crucial in ARIMA modeling as non-stationary data can lead to misleading results.
%>%
sp500_returns features(lprice, unitroot_kpss)
# A tibble: 1 × 2
kpss_stat kpss_pvalue
<dbl> <dbl>
1 27.2 0.01
%>%
sp500_returns features(lprice, unitroot_ndiffs)
# A tibble: 1 × 1
ndiffs
<int>
1 1
Explore Log Returns
Visualize the log returns of the S&P 500 and conduct further tests.
%>%
sp500_returns autoplot(.vars = lreturn)
%>%
sp500_returns features(lreturn, ljung_box, lag = 12)
# A tibble: 1 × 2
lb_stat lb_pvalue
<dbl> <dbl>
1 29.8 0.00300
%>%
sp500_returns gg_tsdisplay(lreturn, plot_type = "partial", lag_max = 12)
Fit ARIMA Models
Fit manual and automatic ARIMA models to the S&P 500 log returns. This step demonstrates how to specify ARIMA models manually and automatically using R’s modeling capabilities.
Manual ARIMA Model
Fit a manual ARIMA model with specified parameters. This approach allows us to control the AR and MA components directly, providing insights into model specification and parameter estimation.
<-
model_manual %>%
sp500_returns remove_missing() %>%
model(arima = ARIMA(lreturn ~ pdq(5,0,1)))
%>% report() model_manual
Series: lreturn
Model: ARIMA(5,0,1) w/ mean
Coefficients:
ar1 ar2 ar3 ar4 ar5 ma1 constant
-0.0228 0.0034 -0.0366 -0.0146 -0.0786 -0.0246 5e-04
s.e. 0.2172 0.0222 0.0199 0.0214 0.0200 0.2176 2e-04
sigma^2 estimated as 8.619e-05: log likelihood=8200.41
AIC=-16384.82 AICc=-16384.76 BIC=-16338.19
%>% gg_tsresiduals(lag_max = 12) model_manual
Automatic ARIMA Model
Fit an automatic ARIMA model using R’s built-in selection algorithms. This approach finds the best ARIMA model by optimizing information criteria, making it useful for practical applications.
<-
model_auto %>%
sp500_returns remove_missing() %>%
model(arima = ARIMA(lreturn))
%>% report() %>% coef() model_auto
Series: lreturn
Model: ARIMA(2,0,2) w/ mean
Coefficients:
ar1 ar2 ma1 ma2 constant
0.1606 0.7838 -0.2092 -0.7678 0
s.e. 0.1222 0.1156 0.1301 0.1266 0
sigma^2 estimated as 8.626e-05: log likelihood=8198.37
AIC=-16384.75 AICc=-16384.72 BIC=-16349.77
# A tibble: 5 × 6
.model term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 arima ar1 0.161 0.122 1.31 1.89e- 1
2 arima ar2 0.784 0.116 6.78 1.48e-11
3 arima ma1 -0.209 0.130 -1.61 1.08e- 1
4 arima ma2 -0.768 0.127 -6.06 1.53e- 9
5 arima constant 0.0000231 0.00000447 5.16 2.71e- 7
%>% gg_tsresiduals(lag_max = 12) model_auto
Compare Data and Fitted Values
Compare the actual log returns with the fitted values from the ARIMA model. This helps us assess the model’s performance in capturing the dynamics of the data.
%>%
model_auto augment() %>%
ggplot(aes(date)) +
geom_line(aes(y = lreturn, color = "Data")) +
geom_line(aes(y = .fitted, color = "Fitted")) +
scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00")) +
guides(colour = guide_legend(title = NULL))
Forecast ARIMA Model
Forecast future values using the fitted ARIMA model. Forecasting is a key aspect of ARIMA models, and this step illustrates the model’s ability to predict future log returns.
%>%
model_auto forecast(h = 4) %>%
autoplot(sp500_returns %>% tail(52))
Additional Useful Functions
Explore additional functions for analyzing the fitted ARIMA models. These functions provide detailed diagnostics, coefficients, and model accuracy metrics, helping to understand model fit and performance.
%>% coef() model_manual
# A tibble: 7 × 6
.model term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 arima ar1 -0.0228 0.217 -0.105 0.917
2 arima ar2 0.00340 0.0222 0.153 0.878
3 arima ar3 -0.0366 0.0199 -1.84 0.0656
4 arima ar4 -0.0146 0.0214 -0.683 0.495
5 arima ar5 -0.0786 0.0200 -3.93 0.0000876
6 arima ma1 -0.0246 0.218 -0.113 0.910
7 arima constant 0.000477 0.000182 2.62 0.00881
%>% glance() model_manual
# A tibble: 1 × 8
.model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
1 arima 0.0000862 8200. -16385. -16385. -16338. <cpl [5]> <cpl [1]>
%>% report() model_manual
Series: lreturn
Model: ARIMA(5,0,1) w/ mean
Coefficients:
ar1 ar2 ar3 ar4 ar5 ma1 constant
-0.0228 0.0034 -0.0366 -0.0146 -0.0786 -0.0246 5e-04
s.e. 0.2172 0.0222 0.0199 0.0214 0.0200 0.2176 2e-04
sigma^2 estimated as 8.619e-05: log likelihood=8200.41
AIC=-16384.82 AICc=-16384.76 BIC=-16338.19
%>% fitted() model_manual
# A tsibble: 2,514 x 3 [1]
# Key: .model [1]
.model date .fitted
<chr> <int> <dbl>
1 arima 2 0.000428
2 arima 3 0.000291
3 arima 4 0.000441
4 arima 5 0.000162
5 arima 6 0.000287
6 arima 7 0.0000122
7 arima 8 0.000723
8 arima 9 -0.000350
9 arima 10 0.000475
10 arima 11 0.000709
# ℹ 2,504 more rows
%>% residuals() model_manual
# A tsibble: 2,514 x 3 [1]
# Key: .model [1]
.model date .resid
<chr> <int> <dbl>
1 arima 2 0.00268
2 arima 3 0.000255
3 arima 4 0.00355
4 arima 5 0.00272
5 arima 6 0.00146
6 arima 7 -0.00944
7 arima 8 0.00757
8 arima 9 0.00277
9 arima 10 -0.0114
10 arima 11 0.0117
# ℹ 2,504 more rows
%>% accuracy() model_manual
# A tibble: 1 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima Training 0.00000121 0.00927 0.00638 Inf Inf 0.673 0.688 -0.0000307
Preview of Volatility Processes
In time series analysis, understanding and visualizing volatility—fluctuations in a time series’ variability—is crucial, especially in financial contexts. The code snippet below provides a preview of volatility in S&P 500 returns by focusing on the squared log returns. By squaring the log returns (lreturn^2
), we create a series (lreturn_sq
) that emphasizes periods of high variability, since squaring amplifies larger values (indicative of more volatile periods) while diminishing smaller fluctuations.
%>%
sp500_returns mutate(lreturn_sq = lreturn^2) %>%
autoplot(.vars=lreturn_sq)
This squared log-return series is then visualized using autoplot
, offering a clear depiction of volatility clustering: periods where high or low variability tends to persist. Observing these clusters is valuable, as it reveals underlying patterns of market behavior that can inform the selection and calibration of models like ARIMA or other volatility-focused models.
Exercises
In this exercise, you will build an ARIMA model for a stock of your choice and forecast its returns.
- Using the
data_download
function, download the daily price series for a stock of your choice. - Visualise the daily log returns for the stock, together with the ACF and PACF.
- Automatically fit an ARIMA model for the daily log return series and investigate its properties.
- Plot the fitted return series together with the original data.
- Forecast the next five days of daily log returns.
Bonus: Using the stock’s data, can you spot periods of high volatility