Forecasting with R

Introduction

Time series forecasting is a crucial tool in economic analysis. This tutorial demonstrates how to build and evaluate forecasting models for U.S. unemployment rates using R.

Prerequisites

Before starting, ensure you have the following packages installed:

install.packages(c("readxl", "forecast", "ggplot2", "here", 
                  "dplyr", "tseries", "lubridate", "tidyverse",
                  "knitr", "kableExtra"))

These packages provide essential tools for time series analysis, data manipulation, and visualization. The forecast package, developed by Hyndman and Khandakar, offers state-of-the-art forecasting methods [@hyndman2008].

Data Preparation and Initial Analysis

Loading and Examining the Data

First, let’s import our unemployment data and examine its structure:

# Import data
unemployment_data <- read_excel(
  here("databases/unemployment_data.xlsx"), 
  sheet = "Sheet1"
) 

# Clean and transform data
unemployment_data <- unemployment_data %>%
  mutate(
    Date = my(Date),
    Year = year(Date),
    Month = month(Date)
  ) %>%
  arrange(Date)

# Display first few rows
kable(head(unemployment_data), 
      caption = "First six rows of unemployment data") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

First six rows of unemployment data
Date	UnemploymentRate	us_m_lr	us_w_lr	lr_t25	Year	Month
2000-01-01	4.0	3.3	3.7	3.0	2000	1
2000-02-01	4.1	3.5	3.6	3.0	2000	2
2000-03-01	4.0	3.2	3.7	3.0	2000	3
2000-04-01	3.8	3.1	3.5	2.9	2000	4
2000-05-01	4.0	3.3	3.7	3.0	2000	5
2000-06-01	4.0	3.2	3.7	3.0	2000	6

The data preparation process involves converting dates to proper format and organizing the data chronologically, which is essential for time series analysis.

Let’s examine the data structure:

glimpse(unemployment_data)

Rows: 297
Columns: 7
$ Date             <date> 2000-01-01, 2000-02-01, 2000-03-01, 2000-04-01, 2000…
$ UnemploymentRate <dbl> 4.0, 4.1, 4.0, 3.8, 4.0, 4.0, 4.0, 4.1, 3.9, 3.9, 3.9…
$ us_m_lr          <dbl> 3.3, 3.5, 3.2, 3.1, 3.3, 3.2, 3.3, 3.3, 3.3, 3.3, 3.4…
$ us_w_lr          <dbl> 3.7, 3.6, 3.7, 3.5, 3.7, 3.7, 3.7, 3.8, 3.5, 3.3, 3.4…
$ lr_t25           <dbl> 3.0, 3.0, 3.0, 2.9, 3.0, 3.0, 3.0, 3.1, 3.0, 2.9, 3.0…
$ Year             <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,…
$ Month            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,…

Creating Time Series Object

Now we’ll convert our data into a proper time series object:

# Convert to time series
unemployment_ts <- ts(
  unemployment_data$UnemploymentRate,
  start = c(min(unemployment_data$Year), 
            min(unemployment_data$Month)),
  frequency = 12
)

# Create initial visualization
p1 <- ggplot(unemployment_data, aes(x = Date, y = UnemploymentRate)) +
  geom_line(color = "#2c3e50", size = 1) +
  labs(
    title = "U.S. Unemployment Rate Time Series",
    x = "Year",
    y = "Unemployment Rate (%)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

p1

Interpreting the Time Series Plot

The plot above shows several key features:

Overall trend patterns
Seasonal fluctuations
Notable spikes during economic downturns
Recent trends and patterns

Time Series Analysis

Seasonal Decomposition

Let’s decompose our series to understand its components [@box2015time]:

# Decompose the series
decomposition <- stl(unemployment_ts, s.window = "periodic")

# Plot decomposition
autoplot(decomposition) +
  labs(title = "Seasonal Decomposition of Unemployment Rate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    strip.background = element_blank()
  )

Seasonal Decomposition of Unemployment Rate

Key Components of Decomposition

The decomposition reveals:

Trend: Long-term movement in the series
Seasonal: Regular patterns that repeat with fixed period
Remainder: Random fluctuations after removing trend and seasonal components

Stationarity Testing

We’ll create a comprehensive function for stationarity testing:

# Function to conduct and interpret stationarity tests
check_stationarity <- function(ts_data, alpha = 0.05) {
  # ADF Test
  adf_result <- adf.test(ts_data)
  
  # KPSS Test
  kpss_result <- kpss.test(ts_data)
  
  # Create interpretation
  is_stationary <- adf_result$p.value < alpha
  
  # Return results in a formatted table
  results_df <- data.frame(
    Test = c("ADF Test", "KPSS Test"),
    Statistic = c(adf_result$statistic, kpss_result$statistic),
    P_Value = c(adf_result$p.value, kpss_result$p.value),
    Interpretation = c(
      ifelse(adf_result$p.value < alpha, "Stationary", "Non-stationary"),
      ifelse(kpss_result$p.value > alpha, "Stationary", "Non-stationary")
    )
  )
  
  return(results_df)
}

# Check original series
stationarity_results <- check_stationarity(unemployment_ts)

# Display results in a formatted table
kable(stationarity_results, 
      caption = "Stationarity Test Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Stationarity Test Results
	Test	Statistic	P_Value	Interpretation
Dickey-Fuller	ADF Test	-2.3825898	0.4150041	Non-stationary
KPSS Level	KPSS Test	0.6985689	0.0136756	Non-stationary

Interpreting Stationarity Tests

If the series is non-stationary, we’ll need to apply differencing. Let’s check the differenced series:

# Apply differencing
diff_unemployment_ts <- diff(unemployment_ts)

# Check differenced series
diff_results <- check_stationarity(diff_unemployment_ts)

# Display results
kable(diff_results, 
      caption = "Stationarity Test Results (Differenced Series)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Stationarity Test Results (Differenced Series)
	Test	Statistic	P_Value	Interpretation
Dickey-Fuller	ADF Test	-7.6542460	0.01	Stationary
KPSS Level	KPSS Test	0.0658335	0.10	Stationary

Model Building

Data Splitting

We’ll split our data into training and validation sets:

# Split data into training and validation sets
train_end <- length(unemployment_ts) - 12
train_ts <- window(unemployment_ts, end = c(time(unemployment_ts)[train_end]))
test_ts <- window(unemployment_ts, start = c(time(unemployment_ts)[train_end + 1]))

# Display information about the split
split_info <- data.frame(
  Dataset = c("Training", "Testing"),
  Start_Date = c(format(start(train_ts)), format(start(test_ts))),
  End_Date = c(format(end(train_ts)), format(end(test_ts))),
  N_Observations = c(length(train_ts), length(test_ts))
)

kable(split_info, 
      caption = "Dataset Split Information") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Dataset Split Information
Dataset	Start_Date	End_Date	N_Observations
Training	2000	2023	285
Testing	1	9	12
Training	2023	2024	285
Testing	10	9	12

Model Fitting and Selection

We’ll fit multiple models and compare their performance:

# Function to fit and evaluate multiple models
fit_models <- function(train_data) {
  # Fit different models
  models <- list(
    auto_arima = auto.arima(train_data, seasonal = TRUE),
    ets = ets(train_data),
    tbats = tbats(train_data)
  )
  
  # Generate forecasts
  forecasts <- lapply(models, forecast, h = 12)
  
  # Calculate accuracy metrics
  accuracy_metrics <- lapply(forecasts, accuracy)
  
  # Combine results
  results <- list(
    models = models,
    forecasts = forecasts,
    accuracy = accuracy_metrics
  )
  
  return(results)
}

# Fit models
model_results <- fit_models(train_ts)

# Display model summaries
for (model_name in names(model_results$models)) {
  cat("\n\n### ", toupper(model_name), " Model Summary\n")
  print(summary(model_results$models[[model_name]]))
}



###  AUTO_ARIMA  Model Summary
Series: train_data 
ARIMA(0,1,0) 

sigma^2 = 0.4546:  log likelihood = -291.05
AIC=584.1   AICc=584.11   BIC=587.75

Training set error measures:
                        ME      RMSE       MAE       MPE     MAPE      MASE
Training set -0.0006877193 0.6730919 0.1859789 -0.269922 2.802555 0.1674547
                   ACF1
Training set 0.02106772


###  ETS  Model Summary
ETS(M,A,N) 

Call:
 ets(y = train_data) 

  Smoothing parameters:
    alpha = 0.9996 
    beta  = 0.9996 

  Initial states:
    l = 3.8204 
    b = 0.1452 

  sigma:  0.1205

     AIC     AICc      BIC 
1380.192 1380.407 1398.455 

Training set error measures:
                       ME      RMSE       MAE       MPE     MAPE      MASE
Training set -0.000508763 0.9415695 0.2547617 0.0447237 3.857362 0.2293863
                   ACF1
Training set -0.4278826


###  TBATS  Model Summary
                  Length Class  Mode     
lambda              1    -none- numeric  
alpha               1    -none- numeric  
beta                0    -none- NULL     
damping.parameter   0    -none- NULL     
gamma.values        0    -none- NULL     
ar.coefficients     0    -none- NULL     
ma.coefficients     0    -none- NULL     
likelihood          1    -none- numeric  
optim.return.code   1    -none- numeric  
variance            1    -none- numeric  
AIC                 1    -none- numeric  
parameters          2    -none- list     
seed.states         1    -none- numeric  
fitted.values     285    ts     numeric  
errors            285    ts     numeric  
x                 285    -none- numeric  
seasonal.periods    0    -none- NULL     
y                 285    ts     numeric  
call                2    -none- call     
series              1    -none- character
method              1    -none- character

Model Comparison

Let’s compare the performance of our models:

# Function to compare model performance
compare_models <- function(model_results) {
  # Get metrics for each model
  metrics <- sapply(model_results$forecasts, function(f) {
    acc <- accuracy(f)
    acc[1, ]  # Use first row since we don't have test data yet
  })
  
  # Convert to data frame
  comparison <- as.data.frame(t(metrics))
  rownames(comparison) <- c("ARIMA", "ETS", "TBATS")
  
  return(comparison)
}

# Generate comparison
model_comparison <- compare_models(model_results)

# Select best model based on RMSE
best_model <- model_results$models[[
  which.min(sapply(model_results$forecasts, 
                   function(f) accuracy(f)[1,"RMSE"]))  # Use first row
]]

# Display formatted comparison table
kable(model_comparison, 
      caption = "Model Performance Comparison",
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  add_header_above(c(" " = 1, "Accuracy Metrics" = ncol(model_comparison)))

Model Performance Comparison
	Accuracy Metrics
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
ARIMA	-0.001	0.673	0.186	-0.270	2.803	0.167	0.021
ETS	-0.001	0.942	0.255	0.045	3.857	0.229	-0.428
TBATS	-0.006	0.700	0.195	-0.290	2.924	0.175	-0.183

Final Forecast and Visualization

Let’s generate and visualize forecasts using our best-performing model:

# Generate final forecast
final_forecast <- forecast(best_model, h = 12)

# Create visualization
autoplot(final_forecast) +
  labs(
    title = "12-Month Unemployment Rate Forecast",
    subtitle = paste("Using", class(best_model)[1], "Model"),
    x = "Year",
    y = "Unemployment Rate (%)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position = "bottom"
  )

Forecast Intervals and Summary

Let’s examine the detailed forecast results:

# Create forecast summary table
forecast_table <- data.frame(
  Month = format(seq(as.Date(time(final_forecast$mean)[1]), 
                    by = "month", length.out = 12), "%Y-%m"),
  Point_Forecast = final_forecast$mean,
  Lower_80 = final_forecast$lower[, 1],
  Upper_80 = final_forecast$upper[, 1],
  Lower_95 = final_forecast$lower[, 2],
  Upper_95 = final_forecast$upper[, 2]
)

# Display formatted table
kable(forecast_table, 
      caption = "Detailed Forecast Results",
      digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  add_header_above(c(" " = 2, "80% Interval" = 2, "95% Interval" = 2))

Detailed Forecast Results
		80% Interval		95% Interval
Month	Point_Forecast	Lower_80	Upper_80	Lower_95	Upper_95
1975-07	3.8	2.9358807	4.664119	2.47844359	5.121556
1975-08	3.8	2.5779508	5.022049	1.93103700	5.668963
1975-09	3.8	2.3033015	5.296699	1.51099715	6.089003
1975-10	3.8	2.0717614	5.528239	1.15688717	6.443113
1975-11	3.8	1.8677705	5.732229	0.84491002	6.755090
1975-12	3.8	1.6833487	5.916651	0.56286112	7.037139
1976-01	3.8	1.5137553	6.086245	0.30349039	7.296510
1976-02	3.8	1.3559016	6.244098	0.06207399	7.537926
1976-03	3.8	1.2076421	6.392358	-0.16466924	7.764669
1976-04	3.8	1.0674149	6.532585	-0.37912832	7.979128
1976-05	3.8	0.9340405	6.665959	-0.58310676	8.183107
1976-06	3.8	0.8066030	6.793397	-0.77800571	8.378006

Model Diagnostics

Let’s perform comprehensive diagnostics on our chosen model:

# Perform residual diagnostics and capture the plot
residuals_plot <- ggplot2::autoplot(best_model$residuals) +
  theme_minimal() +
  labs(title = "Model Residual Diagnostics") +
  theme(plot.title = element_text(hjust = 0.5))

# Display the plot
print(residuals_plot)

# Perform Ljung-Box test separately
lb_test <- Box.test(residuals(best_model), type = "Ljung-Box")

# Create diagnostic summary table
diagnostic_summary <- data.frame(
  Metric = c("Ljung-Box Test Statistic", 
             "P-value", 
             "Residual Standard Error"),
  Value = c(lb_test$statistic,
            lb_test$p.value,
            sqrt(var(residuals(best_model))))
)

# Display diagnostic summary
kable(diagnostic_summary, 
      caption = "Model Diagnostic Summary",
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Diagnostic Summary
Metric	Value
Ljung-Box Test Statistic	0.1278
P-value	0.7207
Residual Standard Error	0.6743

Model Residual Diagnostics

Interpreting Diagnostics

The diagnostic plots and tests help us assess model adequacy:

Residual Plot: Checks for patterns in residuals
ACF Plot: Examines autocorrelation in residuals
Histogram: Assesses normality of residuals
Ljung-Box Test: Formal test for residual autocorrelation

Conclusions

This analysis demonstrates a comprehensive approach to time series forecasting using R. Key findings include:

The unemployment rate shows clear seasonal patterns and trends
The {class(best_model)[1]} model provided the best forecasting performance
Diagnostic tests suggest the model adequately captures the data patterns
Forecast intervals provide a measure of prediction uncertainty

Next Steps

Consider these potential improvements:

Incorporate additional predictor variables
Experiment with alternative modeling approaches
Implement rolling-window validation
Explore ensemble methods