Forecasting with R

Introduction

Time series forecasting is a crucial tool in economic analysis. This tutorial demonstrates how to build and evaluate forecasting models for U.S. unemployment rates using R.

Prerequisites

Before starting, ensure you have the following packages installed:

install.packages(c("readxl", "forecast", "ggplot2", "here", 
                  "dplyr", "tseries", "lubridate", "tidyverse",
                  "knitr", "kableExtra"))

These packages provide essential tools for time series analysis, data manipulation, and visualization. The forecast package, developed by Hyndman and Khandakar, offers state-of-the-art forecasting methods [@hyndman2008].

Data Preparation and Initial Analysis

Loading and Examining the Data

First, let’s import our unemployment data and examine its structure:

# Import data
unemployment_data <- read_excel(
  here("databases/unemployment_data.xlsx"), 
  sheet = "Sheet1"
) 

# Clean and transform data
unemployment_data <- unemployment_data %>%
  mutate(
    Date = my(Date),
    Year = year(Date),
    Month = month(Date)
  ) %>%
  arrange(Date)

# Display first few rows
kable(head(unemployment_data), 
      caption = "First six rows of unemployment data") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
First six rows of unemployment data
Date UnemploymentRate us_m_lr us_w_lr lr_t25 Year Month
2000-01-01 4.0 3.3 3.7 3.0 2000 1
2000-02-01 4.1 3.5 3.6 3.0 2000 2
2000-03-01 4.0 3.2 3.7 3.0 2000 3
2000-04-01 3.8 3.1 3.5 2.9 2000 4
2000-05-01 4.0 3.3 3.7 3.0 2000 5
2000-06-01 4.0 3.2 3.7 3.0 2000 6

The data preparation process involves converting dates to proper format and organizing the data chronologically, which is essential for time series analysis.

Let’s examine the data structure:

glimpse(unemployment_data)
Rows: 297
Columns: 7
$ Date             <date> 2000-01-01, 2000-02-01, 2000-03-01, 2000-04-01, 2000…
$ UnemploymentRate <dbl> 4.0, 4.1, 4.0, 3.8, 4.0, 4.0, 4.0, 4.1, 3.9, 3.9, 3.9…
$ us_m_lr          <dbl> 3.3, 3.5, 3.2, 3.1, 3.3, 3.2, 3.3, 3.3, 3.3, 3.3, 3.4…
$ us_w_lr          <dbl> 3.7, 3.6, 3.7, 3.5, 3.7, 3.7, 3.7, 3.8, 3.5, 3.3, 3.4…
$ lr_t25           <dbl> 3.0, 3.0, 3.0, 2.9, 3.0, 3.0, 3.0, 3.1, 3.0, 2.9, 3.0…
$ Year             <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,…
$ Month            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,…

Creating Time Series Object

Now we’ll convert our data into a proper time series object:

# Convert to time series
unemployment_ts <- ts(
  unemployment_data$UnemploymentRate,
  start = c(min(unemployment_data$Year), 
            min(unemployment_data$Month)),
  frequency = 12
)

# Create initial visualization
p1 <- ggplot(unemployment_data, aes(x = Date, y = UnemploymentRate)) +
  geom_line(color = "#2c3e50", size = 1) +
  labs(
    title = "U.S. Unemployment Rate Time Series",
    x = "Year",
    y = "Unemployment Rate (%)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

p1

U.S. Unemployment Rate Time Series
Interpreting the Time Series Plot

The plot above shows several key features:

  1. Overall trend patterns
  2. Seasonal fluctuations
  3. Notable spikes during economic downturns
  4. Recent trends and patterns

Time Series Analysis

Seasonal Decomposition

Let’s decompose our series to understand its components [@box2015time]:

# Decompose the series
decomposition <- stl(unemployment_ts, s.window = "periodic")

# Plot decomposition
autoplot(decomposition) +
  labs(title = "Seasonal Decomposition of Unemployment Rate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    strip.background = element_blank()
  )

Seasonal Decomposition of Unemployment Rate
Key Components of Decomposition

The decomposition reveals:

  • Trend: Long-term movement in the series
  • Seasonal: Regular patterns that repeat with fixed period
  • Remainder: Random fluctuations after removing trend and seasonal components

Stationarity Testing

We’ll create a comprehensive function for stationarity testing:

# Function to conduct and interpret stationarity tests
check_stationarity <- function(ts_data, alpha = 0.05) {
  # ADF Test
  adf_result <- adf.test(ts_data)
  
  # KPSS Test
  kpss_result <- kpss.test(ts_data)
  
  # Create interpretation
  is_stationary <- adf_result$p.value < alpha
  
  # Return results in a formatted table
  results_df <- data.frame(
    Test = c("ADF Test", "KPSS Test"),
    Statistic = c(adf_result$statistic, kpss_result$statistic),
    P_Value = c(adf_result$p.value, kpss_result$p.value),
    Interpretation = c(
      ifelse(adf_result$p.value < alpha, "Stationary", "Non-stationary"),
      ifelse(kpss_result$p.value > alpha, "Stationary", "Non-stationary")
    )
  )
  
  return(results_df)
}

# Check original series
stationarity_results <- check_stationarity(unemployment_ts)

# Display results in a formatted table
kable(stationarity_results, 
      caption = "Stationarity Test Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Stationarity Test Results
Test Statistic P_Value Interpretation
Dickey-Fuller ADF Test -2.3825898 0.4150041 Non-stationary
KPSS Level KPSS Test 0.6985689 0.0136756 Non-stationary
Interpreting Stationarity Tests

If the series is non-stationary, we’ll need to apply differencing. Let’s check the differenced series:

# Apply differencing
diff_unemployment_ts <- diff(unemployment_ts)

# Check differenced series
diff_results <- check_stationarity(diff_unemployment_ts)

# Display results
kable(diff_results, 
      caption = "Stationarity Test Results (Differenced Series)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Stationarity Test Results (Differenced Series)
Test Statistic P_Value Interpretation
Dickey-Fuller ADF Test -7.6542460 0.01 Stationary
KPSS Level KPSS Test 0.0658335 0.10 Stationary

Model Building

Data Splitting

We’ll split our data into training and validation sets:

# Split data into training and validation sets
train_end <- length(unemployment_ts) - 12
train_ts <- window(unemployment_ts, end = c(time(unemployment_ts)[train_end]))
test_ts <- window(unemployment_ts, start = c(time(unemployment_ts)[train_end + 1]))

# Display information about the split
split_info <- data.frame(
  Dataset = c("Training", "Testing"),
  Start_Date = c(format(start(train_ts)), format(start(test_ts))),
  End_Date = c(format(end(train_ts)), format(end(test_ts))),
  N_Observations = c(length(train_ts), length(test_ts))
)

kable(split_info, 
      caption = "Dataset Split Information") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Dataset Split Information
Dataset Start_Date End_Date N_Observations
Training 2000 2023 285
Testing 1 9 12
Training 2023 2024 285
Testing 10 9 12

Model Fitting and Selection

We’ll fit multiple models and compare their performance:

# Function to fit and evaluate multiple models
fit_models <- function(train_data) {
  # Fit different models
  models <- list(
    auto_arima = auto.arima(train_data, seasonal = TRUE),
    ets = ets(train_data),
    tbats = tbats(train_data)
  )
  
  # Generate forecasts
  forecasts <- lapply(models, forecast, h = 12)
  
  # Calculate accuracy metrics
  accuracy_metrics <- lapply(forecasts, accuracy)
  
  # Combine results
  results <- list(
    models = models,
    forecasts = forecasts,
    accuracy = accuracy_metrics
  )
  
  return(results)
}

# Fit models
model_results <- fit_models(train_ts)

# Display model summaries
for (model_name in names(model_results$models)) {
  cat("\n\n### ", toupper(model_name), " Model Summary\n")
  print(summary(model_results$models[[model_name]]))
}


###  AUTO_ARIMA  Model Summary
Series: train_data 
ARIMA(0,1,0) 

sigma^2 = 0.4546:  log likelihood = -291.05
AIC=584.1   AICc=584.11   BIC=587.75

Training set error measures:
                        ME      RMSE       MAE       MPE     MAPE      MASE
Training set -0.0006877193 0.6730919 0.1859789 -0.269922 2.802555 0.1674547
                   ACF1
Training set 0.02106772


###  ETS  Model Summary
ETS(M,A,N) 

Call:
 ets(y = train_data) 

  Smoothing parameters:
    alpha = 0.9996 
    beta  = 0.9996 

  Initial states:
    l = 3.8204 
    b = 0.1452 

  sigma:  0.1205

     AIC     AICc      BIC 
1380.192 1380.407 1398.455 

Training set error measures:
                       ME      RMSE       MAE       MPE     MAPE      MASE
Training set -0.000508763 0.9415695 0.2547617 0.0447237 3.857362 0.2293863
                   ACF1
Training set -0.4278826


###  TBATS  Model Summary
                  Length Class  Mode     
lambda              1    -none- numeric  
alpha               1    -none- numeric  
beta                0    -none- NULL     
damping.parameter   0    -none- NULL     
gamma.values        0    -none- NULL     
ar.coefficients     0    -none- NULL     
ma.coefficients     0    -none- NULL     
likelihood          1    -none- numeric  
optim.return.code   1    -none- numeric  
variance            1    -none- numeric  
AIC                 1    -none- numeric  
parameters          2    -none- list     
seed.states         1    -none- numeric  
fitted.values     285    ts     numeric  
errors            285    ts     numeric  
x                 285    -none- numeric  
seasonal.periods    0    -none- NULL     
y                 285    ts     numeric  
call                2    -none- call     
series              1    -none- character
method              1    -none- character

Model Comparison

Let’s compare the performance of our models:

# Function to compare model performance
compare_models <- function(model_results) {
  # Get metrics for each model
  metrics <- sapply(model_results$forecasts, function(f) {
    acc <- accuracy(f)
    acc[1, ]  # Use first row since we don't have test data yet
  })
  
  # Convert to data frame
  comparison <- as.data.frame(t(metrics))
  rownames(comparison) <- c("ARIMA", "ETS", "TBATS")
  
  return(comparison)
}

# Generate comparison
model_comparison <- compare_models(model_results)

# Select best model based on RMSE
best_model <- model_results$models[[
  which.min(sapply(model_results$forecasts, 
                   function(f) accuracy(f)[1,"RMSE"]))  # Use first row
]]

# Display formatted comparison table
kable(model_comparison, 
      caption = "Model Performance Comparison",
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  add_header_above(c(" " = 1, "Accuracy Metrics" = ncol(model_comparison)))
Model Performance Comparison
Accuracy Metrics
ME RMSE MAE MPE MAPE MASE ACF1
ARIMA -0.001 0.673 0.186 -0.270 2.803 0.167 0.021
ETS -0.001 0.942 0.255 0.045 3.857 0.229 -0.428
TBATS -0.006 0.700 0.195 -0.290 2.924 0.175 -0.183

Final Forecast and Visualization

Let’s generate and visualize forecasts using our best-performing model:

# Generate final forecast
final_forecast <- forecast(best_model, h = 12)

# Create visualization
autoplot(final_forecast) +
  labs(
    title = "12-Month Unemployment Rate Forecast",
    subtitle = paste("Using", class(best_model)[1], "Model"),
    x = "Year",
    y = "Unemployment Rate (%)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position = "bottom"
  )

12-Month Unemployment Rate Forecast

Forecast Intervals and Summary

Let’s examine the detailed forecast results:

# Create forecast summary table
forecast_table <- data.frame(
  Month = format(seq(as.Date(time(final_forecast$mean)[1]), 
                    by = "month", length.out = 12), "%Y-%m"),
  Point_Forecast = final_forecast$mean,
  Lower_80 = final_forecast$lower[, 1],
  Upper_80 = final_forecast$upper[, 1],
  Lower_95 = final_forecast$lower[, 2],
  Upper_95 = final_forecast$upper[, 2]
)

# Display formatted table
kable(forecast_table, 
      caption = "Detailed Forecast Results",
      digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  add_header_above(c(" " = 2, "80% Interval" = 2, "95% Interval" = 2))
Detailed Forecast Results
80% Interval
95% Interval
Month Point_Forecast Lower_80 Upper_80 Lower_95 Upper_95
1975-07 3.8 2.9358807 4.664119 2.47844359 5.121556
1975-08 3.8 2.5779508 5.022049 1.93103700 5.668963
1975-09 3.8 2.3033015 5.296699 1.51099715 6.089003
1975-10 3.8 2.0717614 5.528239 1.15688717 6.443113
1975-11 3.8 1.8677705 5.732229 0.84491002 6.755090
1975-12 3.8 1.6833487 5.916651 0.56286112 7.037139
1976-01 3.8 1.5137553 6.086245 0.30349039 7.296510
1976-02 3.8 1.3559016 6.244098 0.06207399 7.537926
1976-03 3.8 1.2076421 6.392358 -0.16466924 7.764669
1976-04 3.8 1.0674149 6.532585 -0.37912832 7.979128
1976-05 3.8 0.9340405 6.665959 -0.58310676 8.183107
1976-06 3.8 0.8066030 6.793397 -0.77800571 8.378006

Model Diagnostics

Let’s perform comprehensive diagnostics on our chosen model:

# Perform residual diagnostics and capture the plot
residuals_plot <- ggplot2::autoplot(best_model$residuals) +
  theme_minimal() +
  labs(title = "Model Residual Diagnostics") +
  theme(plot.title = element_text(hjust = 0.5))

# Display the plot
print(residuals_plot)

Model Residual Diagnostics
# Perform Ljung-Box test separately
lb_test <- Box.test(residuals(best_model), type = "Ljung-Box")

# Create diagnostic summary table
diagnostic_summary <- data.frame(
  Metric = c("Ljung-Box Test Statistic", 
             "P-value", 
             "Residual Standard Error"),
  Value = c(lb_test$statistic,
            lb_test$p.value,
            sqrt(var(residuals(best_model))))
)

# Display diagnostic summary
kable(diagnostic_summary, 
      caption = "Model Diagnostic Summary",
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Diagnostic Summary
Metric Value
Ljung-Box Test Statistic 0.1278
P-value 0.7207
Residual Standard Error 0.6743

Model Residual Diagnostics

Interpreting Diagnostics

The diagnostic plots and tests help us assess model adequacy:

  1. Residual Plot: Checks for patterns in residuals
  2. ACF Plot: Examines autocorrelation in residuals
  3. Histogram: Assesses normality of residuals
  4. Ljung-Box Test: Formal test for residual autocorrelation

Conclusions

This analysis demonstrates a comprehensive approach to time series forecasting using R. Key findings include:

  1. The unemployment rate shows clear seasonal patterns and trends
  2. The {class(best_model)[1]} model provided the best forecasting performance
  3. Diagnostic tests suggest the model adequately captures the data patterns
  4. Forecast intervals provide a measure of prediction uncertainty
Next Steps

Consider these potential improvements:

  • Incorporate additional predictor variables
  • Experiment with alternative modeling approaches
  • Implement rolling-window validation
  • Explore ensemble methods