In this section, we will explore how to group and summarize data by manipulating large data sets directly in R, rather than using Excel. This approach streamlines your workflow, ensuring that data processing is reproducible and less prone to manual errors.
Specifically, we will demonstrate how to convert daily COVID-19 death data into weekly deaths per million for the LA5 countries (Brazil, Mexico, Chile, Argentina, and Peru) as well as for the broader Latin America and Caribbean (LAC) region.
This process would be highly tedious in Excel, as the dataset contains 429,436 rows of data and manual operations such as filtering, aggregating by week, and calculating rates would require repetitive and error-prone tasks. In contrast, using R, we can complete this entire process in a few clear steps that are scalable, reproducible, and easy to maintain.
Why Use R for Large Datasets?
Efficiency: R is capable of processing and aggregating large datasets quickly. Working with hundreds of thousands of rows, such as our COVID-19 dataset, is cumbersome in Excel but becomes streamlined in R.
Reproducibility: Once the R script is written, it can be re-run at any time on updated data without manual rework. This is particularly helpful when dealing with ongoing, updated datasets.
Scalability: Whether you are working with a few hundred rows or millions of rows, R can handle data with equal ease, and the same code can be reused for future analysis.
Let’s go through the process of working with this dataset step by step.
Loading Data
Loading the Libraries
To begin, we’ll need to load the data from an Excel file using the readxl package. As we have previously seen, this package allows us to read data directly into R.
Let’s load all the necessary libraries we will use for this section.
# Load Libraries library(readxl)library(here)
here() starts at C:/IMF-R-Book
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)library(lubridate) # For handling dates and aggregating by week
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
Loading IMF format
Let’s also load the IMF format so that we can use it for our visualizations later.
The ‘dim’ function allows you to check the number of rows and columns. As we noted earlier, this is a large data set with over 429,000 rows and 67 columns. In Excel, filtering and aggregating this amount of data would be tedious and slow. However, in R, we can quickly filter, aggregate, and calculate new metrics with just a few lines of code.
Filtering Data
Filtering Data for LA5 Countries and LAC
The first task is to filter the dataset to focus only on the LA5 countries and Latin America and Caribbean countries.
We can easily do this by specifying country names, and continents.
# Filter for the LA5 countriesla5_countries <-c("Brazil", "Mexico", "Chile", "Colombia", "Peru")la5_data <- covid_data %>%filter(location %in% la5_countries)# Filter data for the broader Latin America and Caribbean (LAC) region lac_data <- covid_data %>%filter(continent =="South America"| location =="Mexico")
This step narrows down the dataset to focus only on these specific countries, making the data easier to work with for our analysis.
Aggregation of Data
Aggregation to Weekly Data
COVID-19 data is recorded daily, but often a weekly summary provides clearer trends by smoothing out daily fluctuations. In this step, we will aggregate the daily death data into weekly totals.
Let’s first do this for LA5 countries.
# Convert the date column to proper date formatla5_data$date <-as.Date(la5_data$date)# Focus on 2020 and 2021 data only.la5_data <- la5_data %>%filter(date <as.Date("2022-01-01"))# Group data by country and week, and sum the deaths for each weekla5_weekly_data <- la5_data %>%mutate(week =floor_date(date, "week")) %>%# Create a new column for the weekgroup_by(location, week) %>%# Group by country and weeksummarise(weekly_deaths =sum(new_deaths, na.rm =TRUE), # Sum deaths for each weekpopulation =max(population, na.rm =TRUE)) # Retain population for each country
`summarise()` has grouped output by 'location'. You can override using the
`.groups` argument.
Now let’s do it for the LAC region.
# Convert the date column for LAC datalac_data$date <-as.Date(lac_data$date)# Focus on 2020 and 2021 data only.lac_data <- lac_data %>%filter(date <as.Date("2022-01-01"))# Group LAC data by week and calculate total deaths across all LAC countrieslac_weekly_data <- lac_data %>%mutate(week =floor_date(date, "week")) %>%# Create week columngroup_by(week) %>%summarise(weekly_deaths =sum(new_deaths, na.rm =TRUE), # Sum deaths across all LAC countriespopulation =max(population, na.rm =TRUE)) # Total population of LAC region
This step is much more efficient than Excel, where we would have to manually create pivot tables for each country and adjust the data. Here, R automatically groups and aggregates the data for us.
Calculating Deaths per Million
Once we have weekly death totals, we can calculate deaths per million people for each country, which allows for more meaningful comparisons across countries with different population sizes.
# Calculate deaths per million for LAC regionlac_weekly_data <- lac_weekly_data %>%mutate(deaths_per_million = (weekly_deaths / population) *1e6)
In Excel, calculating deaths per million would require manually dividing the weekly totals by the population and then scaling. With R, this calculation is done efficiently using the mutate() function.
Data Visualization
Combining Datasets
To visualize LA5 countries alongside the LAC region, we need to combine the datasets into one. We do this by adding a location column to allow R to match the datasets, and then combine them.
# Add a 'location' column to LAC data to match the LA5 datalac_weekly_data <- lac_weekly_data %>%mutate(location ="LAC")# Combine LA5 and LAC datacombined_weekly_data <-bind_rows(la5_weekly_data, lac_weekly_data)
Plotting the Data
We can now plot the weekly deaths per million for both the LA5 countries and the LAC region, allowing us to compare trends between individual countries and the region as a whole.
# Plot the weekly deaths per million for LA5 countries and LAC regionggplot(combined_weekly_data, aes(x = week, y = deaths_per_million, color = location)) +geom_line(linewidth =1) +labs(title ="Weekly COVID-19 Deaths per Million in LA5 Countries and LAC Region",x ="Week",y ="Deaths per Million",color ="Region/Country") +theme_minimal()
Finalizing the plot
As a last step, let’s use the IMF theme to beautify it.
# IMF color paletteimf_colors <-c("Brazil"="#4B82AD", "Mexico"="#96BA79", "Chile"="#CAE0FB", "Colombia"="#D3D3D3", "Peru"="#EE2400", "LAC"="#DAA520")# Plot the weekly deaths per million for LA5 countries and LAC regionggplot(combined_weekly_data, aes(x = week, y = deaths_per_million, color = location)) +geom_line(linewidth =1) +scale_color_manual(values = imf_colors) +scale_x_date( date_labels ="%m/%y", date_breaks ="3 months") +# Format as MM/YYlabs(title ="Weekly COVID-19 Deaths per Million in LA5 Countries\nand LAC Region (2020 - July 2022)",subtitle ="(Using new deaths per week)",x ="Date",y ="Deaths per Million") +# Remove legend titletheme_imf() +theme(plot.title =element_text(hjust =0, size =16, face ="bold"),plot.subtitle =element_text(hjust =0, size =12),axis.text.x =element_text(angle =0, hjust =0.5, size =10),axis.title.x =element_text(size =12),axis.title.y =element_text(size =12),legend.position ="bottom",legend.title =element_blank() # Remove "Country/Region" from legend )
This looks much better!
In this section we have demonstrated how R is a powerful tool for data manipulation, particularly when working with large datasets like the COVID-19 statistics, and can simply the process of working with large volumes of data.