Grouping and summarizing data

In this section, we will explore how to group and summarize data by manipulating large data sets directly in R, rather than using Excel. This approach streamlines your workflow, ensuring that data processing is reproducible and less prone to manual errors.

Specifically, we will demonstrate how to convert daily COVID-19 death data into weekly deaths per million for the LA5 countries (Brazil, Mexico, Chile, Argentina, and Peru) as well as for the broader Latin America and Caribbean (LAC) region.

This process would be highly tedious in Excel, as the dataset contains 429,436 rows of data and manual operations such as filtering, aggregating by week, and calculating rates would require repetitive and error-prone tasks. In contrast, using R, we can complete this entire process in a few clear steps that are scalable, reproducible, and easy to maintain.

Why Use R for Large Datasets?

  • Efficiency: R is capable of processing and aggregating large datasets quickly. Working with hundreds of thousands of rows, such as our COVID-19 dataset, is cumbersome in Excel but becomes streamlined in R.

  • Reproducibility: Once the R script is written, it can be re-run at any time on updated data without manual rework. This is particularly helpful when dealing with ongoing, updated datasets.

  • Scalability: Whether you are working with a few hundred rows or millions of rows, R can handle data with equal ease, and the same code can be reused for future analysis.

Let’s go through the process of working with this dataset step by step.

Loading Data

Loading the Libraries

To begin, we’ll need to load the data from an Excel file using the readxl package. As we have previously seen, this package allows us to read data directly into R.

Let’s load all the necessary libraries we will use for this section.

# Load Libraries 
library(readxl)
library(here)
here() starts at C:/IMF-R-Book
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)  # For handling dates and aggregating by week

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Loading IMF format

Let’s also load the IMF format so that we can use it for our visualizations later.

source(here("utils/theme_and_colors_IMF.R")) 
source(here("utils/Add_text_to_figure_panel.R"))

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine

Loading the Data

Let’s load a large Covid-19 dataset downloaded from Our World in Data that contains daily records for all countries around the world.

covid_data <- read_excel(here("databases/owid-covid-data.xlsx"), sheet = "Sheet1") 
dim(covid_data)
[1] 429435     67

The ‘dim’ function allows you to check the number of rows and columns. As we noted earlier, this is a large data set with over 429,000 rows and 67 columns. In Excel, filtering and aggregating this amount of data would be tedious and slow. However, in R, we can quickly filter, aggregate, and calculate new metrics with just a few lines of code.

Filtering Data

Filtering Data for LA5 Countries and LAC

The first task is to filter the dataset to focus only on the LA5 countries and Latin America and Caribbean countries.

We can easily do this by specifying country names, and continents.

# Filter for the LA5 countries
la5_countries <- c("Brazil", "Mexico", "Chile", "Colombia", "Peru")
la5_data <- covid_data %>%
  filter(location %in% la5_countries)

# Filter data for the broader Latin America and Caribbean (LAC) region 
lac_data <- covid_data %>%
  filter(continent == "South America" | location == "Mexico") 

This step narrows down the dataset to focus only on these specific countries, making the data easier to work with for our analysis.

Aggregation of Data

Aggregation to Weekly Data

COVID-19 data is recorded daily, but often a weekly summary provides clearer trends by smoothing out daily fluctuations. In this step, we will aggregate the daily death data into weekly totals.

Let’s first do this for LA5 countries.

# Convert the date column to proper date format
la5_data$date <- as.Date(la5_data$date)

# Focus on 2020 and 2021 data only.
la5_data <- la5_data %>% filter(date < as.Date("2022-01-01"))




# Group data by country and week, and sum the deaths for each week
la5_weekly_data <- la5_data %>%
  mutate(week = floor_date(date, "week")) %>%  # Create a new column for the week
  group_by(location, week) %>%  # Group by country and week
  summarise(weekly_deaths = sum(new_deaths, na.rm = TRUE),  # Sum deaths for each week
            population = max(population, na.rm = TRUE))  # Retain population for each country
`summarise()` has grouped output by 'location'. You can override using the
`.groups` argument.

Now let’s do it for the LAC region.

# Convert the date column for LAC data
lac_data$date <- as.Date(lac_data$date)


# Focus on 2020 and 2021 data only.
lac_data <- lac_data %>% filter(date < as.Date("2022-01-01"))



# Group LAC data by week and calculate total deaths across all LAC countries
lac_weekly_data <- lac_data %>%
  mutate(week = floor_date(date, "week")) %>%  # Create week column
  group_by(week) %>%
  summarise(weekly_deaths = sum(new_deaths, na.rm = TRUE),  # Sum deaths across all LAC countries
            population = max(population, na.rm = TRUE))  # Total population of LAC region

This step is much more efficient than Excel, where we would have to manually create pivot tables for each country and adjust the data. Here, R automatically groups and aggregates the data for us.

Calculating Deaths per Million

Once we have weekly death totals, we can calculate deaths per million people for each country, which allows for more meaningful comparisons across countries with different population sizes.

For LA5 countries:

# Calculate deaths per million
la5_weekly_data <- la5_weekly_data %>%
  mutate(deaths_per_million = (weekly_deaths / population) * 1e6)

For LAC region:

# Calculate deaths per million for LAC region
lac_weekly_data <- lac_weekly_data %>%
  mutate(deaths_per_million = (weekly_deaths / population) * 1e6)

In Excel, calculating deaths per million would require manually dividing the weekly totals by the population and then scaling. With R, this calculation is done efficiently using the mutate() function.

Data Visualization

Combining Datasets

To visualize LA5 countries alongside the LAC region, we need to combine the datasets into one. We do this by adding a location column to allow R to match the datasets, and then combine them.

# Add a 'location' column to LAC data to match the LA5 data
lac_weekly_data <- lac_weekly_data %>%
  mutate(location = "LAC")

# Combine LA5 and LAC data
combined_weekly_data <- bind_rows(la5_weekly_data, lac_weekly_data)

Plotting the Data

We can now plot the weekly deaths per million for both the LA5 countries and the LAC region, allowing us to compare trends between individual countries and the region as a whole.

# Plot the weekly deaths per million for LA5 countries and LAC region
ggplot(combined_weekly_data, aes(x = week, y = deaths_per_million, color = location)) +
  geom_line(linewidth = 1) +
  labs(title = "Weekly COVID-19 Deaths per Million in LA5 Countries and LAC Region",
       x = "Week",
       y = "Deaths per Million",
       color = "Region/Country") +
  theme_minimal()

Finalizing the plot

As a last step, let’s use the IMF theme to beautify it.

# IMF color palette
imf_colors <- c("Brazil" = "#4B82AD", 
                "Mexico" = "#96BA79", 
                "Chile" = "#CAE0FB", 
                "Colombia" = "#D3D3D3", 
                "Peru" = "#EE2400", 
                "LAC" = "#DAA520")



# Plot the weekly deaths per million for LA5 countries and LAC region
ggplot(combined_weekly_data, aes(x = week, y = deaths_per_million, color = location)) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = imf_colors) +
  scale_x_date( date_labels = "%m/%y", date_breaks = "3 months") +  # Format as MM/YY
  labs(title = "Weekly COVID-19 Deaths per Million in LA5 Countries\nand LAC Region (2020 - July 2022)",
       subtitle = "(Using new deaths per week)",
       x = "Date",
       y = "Deaths per Million") +  # Remove legend title
  theme_imf() +
  theme(
    plot.title = element_text(hjust = 0, size = 16, face = "bold"),
    plot.subtitle = element_text(hjust = 0, size = 12),
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_blank()  # Remove "Country/Region" from legend
  )

This looks much better!

In this section we have demonstrated how R is a powerful tool for data manipulation, particularly when working with large datasets like the COVID-19 statistics, and can simply the process of working with large volumes of data.