Day 1 Part 4: ARIMA Modelling

Author

Dr Christian Engels

Last updated

November 16, 2024

Load Libraries

Load necessary libraries for data manipulation, finance data, and time-series analysis. These libraries will help us handle time series data, conduct ARIMA modeling, and visualize results effectively.

library(tidyverse)
library(tidyfinance)
library(tsibble)
library(fable)
library(feasts)
library(scales)
library(fabletools)

ARIMA Processes

ARIMA (AutoRegressive Integrated Moving Average) processes are commonly used in time series analysis to model and predict future values by combining autoregressive (AR), moving average (MA), and differencing components. The AR part captures the influence of past values, the MA part captures the influence of past forecast errors, and differencing ensures the series is stationary. ARIMA models are particularly useful for understanding trends and cyclic behaviors in time series data.

An autoregressive model of order p, denoted as AR(p), is written as:

y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t,

where \varepsilon_t is white noise. This structure resembles a multiple regression with lagged values of y_t as predictors.

In contrast, a moving average model of order q, denoted MA(q), uses past forecast errors as predictors:

y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \cdots + \theta_q \varepsilon_{t-q},

where \varepsilon_t is white noise. Since we do not directly observe the values of \varepsilon_t, it’s not a true regression in the usual sense.

When differencing is combined with AR and MA components, we get a non-seasonal ARIMA model, where “integration” refers to the differencing process. The full model is written as:

y'_t = c + \phi_1 y'_{t-1} + \cdots + \phi_p y'_{t-p} + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} + \varepsilon_t,

where y'_t is the differenced series. This is an ARIMA(p, d, q) model, where:

p is the order of the autoregressive part,
d is the degree of differencing required,
q is the order of the moving average part.

The ARIMA model framework is powerful for capturing both trend and cyclical behaviors in time series data, making it a fundamental tool in forecasting.

Simulate ARIMA Processes

Here we simulate ARIMA processes to illustrate their characteristics.

set.seed(42)

n <- 10^2
innovations <- 
  tibble(
    time = 1:n, 
    e = rnorm(n),
    e_lag = lag(e, default = 0),
  ) %>% 
  as_tsibble(index = time)

innovations

# A tsibble: 100 x 3 [1]
    time       e   e_lag
   <int>   <dbl>   <dbl>
 1     1  1.37    0     
 2     2 -0.565   1.37  
 3     3  0.363  -0.565 
 4     4  0.633   0.363 
 5     5  0.404   0.633 
 6     6 -0.106   0.404 
 7     7  1.51   -0.106 
 8     8 -0.0947  1.51  
 9     9  2.02   -0.0947
10    10 -0.0627  2.02  
# ℹ 90 more rows

Plot Innovations

Visualize the random innovations generated for the ARIMA processes. This helps us understand the underlying noise component that drives the stochastic processes in ARIMA models.

innovations %>% autoplot(e)

innovations %>% 
  ggplot(aes(e)) + 
  geom_histogram(aes(y = after_stat(density))) +
  stat_function(fun = dnorm, color = "red")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Generate ARIMA Processes

Generate different ARIMA processes using the simulated innovations. We create ARMA(1,0), ARMA(0,1), and ARMA(1,1) processes to illustrate how varying the parameters affects the behavior of the time series.

arma_processes <- 
  innovations %>% 
  mutate(
    random_walk = accumulate(
      e, 
      (\(y_lag, e) y_lag + e)
    ),
    arma10 = accumulate(
      e, 
      (\(y_lag, e) 0.9*y_lag + e)
    ),
    arma01 = -0.9*e_lag + e,
    arma11 = accumulate2(
      e, 
      e_lag,
      .f = (\(y_lag, e, e_lag) 0.9 * y_lag + 0.9 * e_lag + e),
      .init = 0
    )[-1]  
  )

arma_processes

# A tsibble: 100 x 7 [1]
    time       e   e_lag random_walk arma10 arma01 arma11
   <int>   <dbl>   <dbl>       <dbl>  <dbl>  <dbl>  <dbl>
 1     1  1.37    0            1.37   1.37   1.37    1.37
 2     2 -0.565   1.37         0.806  0.669 -1.80    1.90
 3     3  0.363  -0.565        1.17   0.965  0.871   1.57
 4     4  0.633   0.363        1.80   1.50   0.306   2.37
 5     5  0.404   0.633        2.21   1.76  -0.165   3.11
 6     6 -0.106   0.404        2.10   1.47  -0.470   3.05
 7     7  1.51   -0.106        3.61   2.84   1.61    4.16
 8     8 -0.0947  1.51         3.52   2.46  -1.46    5.01
 9     9  2.02   -0.0947       5.54   4.23   2.10    6.45
10    10 -0.0627  2.02         5.47   3.75  -1.88    7.56
# ℹ 90 more rows

Display ARIMA Components

Visualize the generated ARIMA components using their autocorrelation functions (ACF) and partial autocorrelation functions (PACF). This helps us understand the persistence, memory, and dependence structures within each process.

In time series analysis, autocorrelation quantifies the relationship between values of a series at different time lags, offering insights into how past values influence future values within the same series. Unlike simple correlation, which measures the linear relationship between two distinct variables, autocorrelation assesses dependencies within a time series by examining lagged values.

Each lag in a time series has an associated autocorrelation coefficient, with r_1 representing the correlation between y_t and y_{t-1}, r_2 capturing the relationship between y_t and y_{t-2}, and so forth. The formula to calculate the autocorrelation at lag k is:

r_k = \frac{\sum_{t=k+1}^T (y_t - \bar{y})(y_{t-k} - \bar{y})}{\sum_{t=1}^T (y_t - \bar{y})^2},

where T denotes the length of the time series, and \bar{y} is the mean of the series. The set of autocorrelation coefficients for various lags forms the autocorrelation function (ACF), which helps identify patterns and dependencies in the data.

However, an ACF plot can sometimes be misleading. For example, if y_t and y_{t-1} are correlated, then y_t and y_{t-2} may also appear correlated—not due to new information in y_{t-2}, but simply because both are related to y_{t-1}. This can create a chain of dependencies that may not accurately reflect the unique influence of each lag.

To address this issue, we use partial autocorrelations, which measure the relationship between y_t and y_{t-k} after controlling for the influence of intermediate lags (1 through k-1). The first partial autocorrelation is the same as the first autocorrelation, as there are no intervening lags to account for. Each subsequent partial autocorrelation isolates the direct effect of that specific lag, removing the influence of shorter lags. In an autoregressive model, each partial autocorrelation can be estimated as the final coefficient in an AR model of that order. Specifically, the kth partial autocorrelation, \alpha_k, corresponds to the coefficient \phi_k in an AR(k) model.

This refined approach gives us the partial autocorrelation function (PACF), which provides a clearer picture of the direct relationships between values at specific lags, making it a valuable tool for model selection and interpretation in time series analysis.

arma_processes %>% 
  gg_tsdisplay(
    arma10, 
    plot_type = "partial", 
    lag_max = 12
    )

arma_processes %>%
  gg_tsdisplay(
    arma01, 
    plot_type = "partial", 
    lag_max = 12
    )

arma_processes %>% 
  gg_tsdisplay(
    arma11, 
    plot_type = "partial", 
    lag_max = 12
    )

For data following an ARIMA model with either AR terms (ARIMA(p, d, 0)) or MA terms (ARIMA(0, d, q)), the ACF and PACF plots can assist in identifying the appropriate values of p or q. When both parameters are positive, however, the plots may not distinctly guide the selection of these values.

To identify an ARIMA(p, d, 0) model, check the ACF and PACF plots of the differenced series for these patterns:

The ACF shows an exponentially decaying or sinusoidal trend.
The PACF displays a significant spike at lag p, with minimal activity beyond this point.

In contrast, data may fit an ARIMA(0, d, q) model if the ACF and PACF of the differenced series reveal:

An exponentially decaying or sinusoidal pattern in the PACF.
A distinct spike at lag q in the ACF, with no significant lags thereafter.

These patterns help differentiate between autoregressive and moving average components in the time series, making model identification clearer.

SP500 Daily Returns

Download and analyze SP500 daily returns. We use the S&P 500 index to study real-world time series data, which is useful for understanding the practical application of ARIMA modeling.

sp500_returns <- 
  download_data(
    "stock_prices", 
    symbols = "^GSPC", 
    start = "2010-01-01", 
    end = "2019-12-31"
  ) %>% 
  rename(price = adjusted_close) %>% 
  select(symbol, date, price) %>% 
  mutate(
    date = row_number(),
    lprice = log(price),
    lreturn = difference(lprice)
  ) %>% 
  remove_missing() %>% 
  as_tsibble(index = date) %>% 
  glimpse()

Warning: Removed 1 row containing missing values or values outside the scale
range.

Rows: 2,514
Columns: 5
$ symbol  <chr> "^GSPC", "^GSPC", "^GSPC", "^GSPC", "^GSPC", "^GSPC", "^GSPC",…
$ date    <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
$ price   <dbl> 1136.52, 1137.14, 1141.69, 1144.98, 1146.98, 1136.22, 1145.68,…
$ lprice  <dbl> 7.035726, 7.036272, 7.040265, 7.043142, 7.044888, 7.035462, 7.…
$ lreturn <dbl> 0.0031108320, 0.0005453719, 0.0039932184, 0.0028775831, 0.0017…

Explore Log Prices

Explore and visualize the log-transformed prices of the S&P 500. The log transformation helps stabilize variance and allows us to focus on percentage changes.

sp500_returns %>% 
  gg_tsdisplay(
    lprice, 
    plot_type = "partial", 
    lag_max = 12
    )

sp500_returns %>% 
  autoplot(.vars = lprice)

Perform Stationarity Tests

Conduct stationarity tests on the log prices to determine if differencing is required. Stationarity is crucial in ARIMA modeling as non-stationary data can lead to misleading results.

sp500_returns %>% 
  features(lprice, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1      27.2        0.01

sp500_returns %>% 
  features(lprice, unitroot_ndiffs)

# A tibble: 1 × 1
  ndiffs
   <int>
1      1

Explore Log Returns

Visualize the log returns of the S&P 500 and conduct further tests.

sp500_returns %>% 
  autoplot(.vars = lreturn)

sp500_returns %>% 
  features(lreturn, ljung_box, lag = 12)

# A tibble: 1 × 2
  lb_stat lb_pvalue
    <dbl>     <dbl>
1    29.8   0.00300

sp500_returns %>% 
  gg_tsdisplay(lreturn, plot_type = "partial", lag_max = 12)

Fit ARIMA Models

Fit manual and automatic ARIMA models to the S&P 500 log returns. This step demonstrates how to specify ARIMA models manually and automatically using R’s modeling capabilities.

Manual ARIMA Model

Fit a manual ARIMA model with specified parameters. This approach allows us to control the AR and MA components directly, providing insights into model specification and parameter estimation.

model_manual <- 
  sp500_returns %>% 
  remove_missing() %>% 
  model(arima = ARIMA(lreturn ~ pdq(5,0,1)))

model_manual %>% report()

Series: lreturn 
Model: ARIMA(5,0,1) w/ mean 

Coefficients:
          ar1     ar2      ar3      ar4      ar5      ma1  constant
      -0.0228  0.0034  -0.0366  -0.0146  -0.0786  -0.0246     5e-04
s.e.   0.2172  0.0222   0.0199   0.0214   0.0200   0.2176     2e-04

sigma^2 estimated as 8.619e-05:  log likelihood=8200.41
AIC=-16384.82   AICc=-16384.76   BIC=-16338.19

model_manual %>% gg_tsresiduals(lag_max = 12)

Automatic ARIMA Model

Fit an automatic ARIMA model using R’s built-in selection algorithms. This approach finds the best ARIMA model by optimizing information criteria, making it useful for practical applications.

model_auto <- 
  sp500_returns %>% 
  remove_missing() %>%
  model(arima = ARIMA(lreturn))
model_auto %>% report() %>% coef()

Series: lreturn 
Model: ARIMA(2,0,2) w/ mean 

Coefficients:
         ar1     ar2      ma1      ma2  constant
      0.1606  0.7838  -0.2092  -0.7678         0
s.e.  0.1222  0.1156   0.1301   0.1266         0

sigma^2 estimated as 8.626e-05:  log likelihood=8198.37
AIC=-16384.75   AICc=-16384.72   BIC=-16349.77

# A tibble: 5 × 6
  .model term       estimate  std.error statistic  p.value
  <chr>  <chr>         <dbl>      <dbl>     <dbl>    <dbl>
1 arima  ar1       0.161     0.122           1.31 1.89e- 1
2 arima  ar2       0.784     0.116           6.78 1.48e-11
3 arima  ma1      -0.209     0.130          -1.61 1.08e- 1
4 arima  ma2      -0.768     0.127          -6.06 1.53e- 9
5 arima  constant  0.0000231 0.00000447      5.16 2.71e- 7

model_auto %>% gg_tsresiduals(lag_max = 12)

Compare Data and Fitted Values

Compare the actual log returns with the fitted values from the ARIMA model. This helps us assess the model’s performance in capturing the dynamics of the data.

model_auto %>% 
  augment() %>% 
  ggplot(aes(date)) + 
  geom_line(aes(y = lreturn, color = "Data")) + 
  geom_line(aes(y = .fitted, color = "Fitted")) +
  scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00")) +
  guides(colour = guide_legend(title = NULL))

Forecast ARIMA Model

Forecast future values using the fitted ARIMA model. Forecasting is a key aspect of ARIMA models, and this step illustrates the model’s ability to predict future log returns.

model_auto %>% 
  forecast(h = 4) %>% 
  autoplot(sp500_returns %>% tail(52))

Additional Useful Functions

Explore additional functions for analyzing the fitted ARIMA models. These functions provide detailed diagnostics, coefficients, and model accuracy metrics, helping to understand model fit and performance.

model_manual %>% coef()

# A tibble: 7 × 6
  .model term      estimate std.error statistic   p.value
  <chr>  <chr>        <dbl>     <dbl>     <dbl>     <dbl>
1 arima  ar1      -0.0228    0.217       -0.105 0.917    
2 arima  ar2       0.00340   0.0222       0.153 0.878    
3 arima  ar3      -0.0366    0.0199      -1.84  0.0656   
4 arima  ar4      -0.0146    0.0214      -0.683 0.495    
5 arima  ar5      -0.0786    0.0200      -3.93  0.0000876
6 arima  ma1      -0.0246    0.218       -0.113 0.910    
7 arima  constant  0.000477  0.000182     2.62  0.00881

model_manual %>% glance()

# A tibble: 1 × 8
  .model    sigma2 log_lik     AIC    AICc     BIC ar_roots  ma_roots 
  <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <list>    <list>   
1 arima  0.0000862   8200. -16385. -16385. -16338. <cpl [5]> <cpl [1]>

model_manual %>% report()

Series: lreturn 
Model: ARIMA(5,0,1) w/ mean 

Coefficients:
          ar1     ar2      ar3      ar4      ar5      ma1  constant
      -0.0228  0.0034  -0.0366  -0.0146  -0.0786  -0.0246     5e-04
s.e.   0.2172  0.0222   0.0199   0.0214   0.0200   0.2176     2e-04

sigma^2 estimated as 8.619e-05:  log likelihood=8200.41
AIC=-16384.82   AICc=-16384.76   BIC=-16338.19

model_manual %>% fitted()

# A tsibble: 2,514 x 3 [1]
# Key:       .model [1]
   .model  date    .fitted
   <chr>  <int>      <dbl>
 1 arima      2  0.000428 
 2 arima      3  0.000291 
 3 arima      4  0.000441 
 4 arima      5  0.000162 
 5 arima      6  0.000287 
 6 arima      7  0.0000122
 7 arima      8  0.000723 
 8 arima      9 -0.000350 
 9 arima     10  0.000475 
10 arima     11  0.000709 
# ℹ 2,504 more rows

model_manual %>% residuals()

# A tsibble: 2,514 x 3 [1]
# Key:       .model [1]
   .model  date    .resid
   <chr>  <int>     <dbl>
 1 arima      2  0.00268 
 2 arima      3  0.000255
 3 arima      4  0.00355 
 4 arima      5  0.00272 
 5 arima      6  0.00146 
 6 arima      7 -0.00944 
 7 arima      8  0.00757 
 8 arima      9  0.00277 
 9 arima     10 -0.0114  
10 arima     11  0.0117  
# ℹ 2,504 more rows

model_manual %>% accuracy()

# A tibble: 1 × 10
  .model .type            ME    RMSE     MAE   MPE  MAPE  MASE RMSSE       ACF1
  <chr>  <chr>         <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
1 arima  Training 0.00000121 0.00927 0.00638   Inf   Inf 0.673 0.688 -0.0000307

Preview of Volatility Processes

In time series analysis, understanding and visualizing volatility—fluctuations in a time series’ variability—is crucial, especially in financial contexts. The code snippet below provides a preview of volatility in S&P 500 returns by focusing on the squared log returns. By squaring the log returns (lreturn^2), we create a series (lreturn_sq) that emphasizes periods of high variability, since squaring amplifies larger values (indicative of more volatile periods) while diminishing smaller fluctuations.

sp500_returns %>% 
  mutate(lreturn_sq = lreturn^2) %>% 
  autoplot(.vars=lreturn_sq)

This squared log-return series is then visualized using autoplot, offering a clear depiction of volatility clustering: periods where high or low variability tends to persist. Observing these clusters is valuable, as it reveals underlying patterns of market behavior that can inform the selection and calibration of models like ARIMA or other volatility-focused models.

Exercises

In this exercise, you will build an ARIMA model for a stock of your choice and forecast its returns.

Using the data_download function, download the daily price series for a stock of your choice.
Visualise the daily log returns for the stock, together with the ACF and PACF.
Automatically fit an ARIMA model for the daily log return series and investigate its properties.
Plot the fitted return series together with the original data.
Forecast the next five days of daily log returns.

Bonus: Using the stock’s data, can you spot periods of high volatility