library(wbstats)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggrepel)
library(here)
source(here("utils/theme_and_colors_IMF.R"))Self employment and GDP per worker
Introduction
In this tutorial, we will analyze the relationship between self-employment and labor productivity (measured as GDP per worker) using data from the World Bank. We will process the data, create visualizations, and interpret the results.
Load Necessary Libraries
First, we need to load the necessary libraries. These include wbstats for accessing World Bank data, dplyr and tidyr for data manipulation, and ggplot2 and ggrepel for data visualization.
Load and Process Data
Self-Employment Data
We start by loading the self-employment data using the wb_data function from the wbstats package. We filter the data to include years from 1990 to 2022 and rename columns for clarity.
self_employment <- wb_data(indicator = "SL.EMP.SELF.ZS",country = "all")
self_employment_data <- self_employment %>%
filter(date %in% 1990:2022) %>%
rename(year = date, self_employment = SL.EMP.SELF.ZS) %>%
filter(iso3c %in% c("EAS", "LCN", "SSF", "HIC", "CEB","USA")) %>%
select(iso3c,year,self_employment)
self_employment_data$year <- as.numeric(self_employment_data$year)
head(self_employment_data)# A tibble: 6 × 3
iso3c year self_employment
<chr> <dbl> <dbl>
1 CEB 2022 17.6
2 CEB 2021 17.8
3 CEB 2020 18.0
4 CEB 2019 17.8
5 CEB 2018 18.1
6 CEB 2017 18.5
GDP Per Worker Data
Next, we load the GDP per worker (labor productivity) data, filter it for the same years, and rename the columns.
gdp_per_worker <- wb_data(indicator = "SL.GDP.PCAP.EM.KD",country="all")
gdp_data <- gdp_per_worker %>%
filter(date %in% 1990:2022) %>%
rename(gdp_per_worker = SL.GDP.PCAP.EM.KD, year = date) %>%
filter(iso3c %in% c("EAS", "LCN", "SSF", "HIC", "CEB","USA")) %>%
select(iso3c,year,gdp_per_worker)
gdp_data$year <- as.numeric(gdp_data$year)
head(gdp_data)# A tibble: 6 × 3
iso3c year gdp_per_worker
<chr> <dbl> <dbl>
1 CEB 1990 NA
2 CEB 1991 38326.
3 CEB 1992 36717.
4 CEB 1993 37507.
5 CEB 1994 38874.
6 CEB 1995 41023.
Merge and Process Data
We merge the self-employment and GDP per worker data into a single data frame and adjust the GDP values to be in thousands.
merged_data <- merge(gdp_data, self_employment_data, by = c("iso3c", "year"))
merged_data$gdp_per_worker <- merged_data$gdp_per_worker / 1000
selected_data <- merged_data %>% select(iso3c, year, gdp_per_worker, self_employment)
head(selected_data) iso3c year gdp_per_worker self_employment
1 CEB 1990 NA NA
2 CEB 1991 38.32582 23.23896
3 CEB 1992 36.71664 23.70352
4 CEB 1993 37.50670 24.21145
5 CEB 1994 38.87385 24.30136
6 CEB 1995 41.02278 24.73645
Adjust GDP Values Relative to the USA
We pivot the GDP data to a wide format, adjust the values relative to the USA, and clean up the column names.
Certainly! Here is an expanded explanation for the adjust_values function:
adjust_values <- function(df) {
df %>%
rowwise() %>%
mutate(across(.cols = -c(year, USA), .fns = ~ . / USA,
.names = "relative_{.col}")) %>%
select(year, starts_with("relative_"))
}rowwise(): This function is used to indicate that operations should be performed on a row-by-row basis. In this context, it ensures that the adjustment of GDP per worker values relative to the USA is done for each individual row, i.e., for each year and country combination.mutate(across(.cols = -c(year, USA), .fns = ~ . / USA, .names = "relative_{.col}")): This line performs the main adjustment operation.mutate(): This function is used to create new columns or modify existing ones.across(): This helper function specifies which columns to apply the function to..cols = -c(year, USA): This specifies that all columns exceptyearandUSAshould be included in the operation..fns = ~ . / USA: This specifies the function to be applied, which in this case is dividing each value by the corresponding value in theUSAcolumn. The tilde (~) is shorthand for creating an anonymous function in R..names = "relative_{.col}": This specifies the naming convention for the new columns. Each new column name will be prefixed with “relative_” followed by the original column name, indicating that the values are relative to the USA.
select(year, starts_with("relative_")): This line selects the columns to keep in the resulting data frame.select(year, starts_with("relative_")): This retains only theyearcolumn and any columns whose names start with “relative_”. This effectively drops the original columns, leaving only the relative values and the year.
Example Workflow
- Original Data Frame:
- Contains columns for different countries’ GDP per worker, including the USA, for multiple years.
- Step-by-Step Adjustment:
rowwise(): Ensures the function operates row by row.mutate(across(.cols = -c(year, USA), .fns = ~ . / USA, .names = "relative_{.col}")):- For each row, divide the GDP per worker of each country by the GDP per worker of the USA for that same row.
- Create new columns with names prefixed by “relative_” to indicate the adjustment.
select(year, starts_with("relative_")):- Keep only the
yearcolumn and the newly created “relative_” columns, simplifying the data frame to the adjusted values.
- Keep only the
This function is useful for comparing GDP per worker across countries relative to the USA, providing a standardized way to understand relative productivity levels.
Applying the Function
gdp_wide <- selected_data %>%
select(iso3c, year, gdp_per_worker) %>%
pivot_wider(names_from = iso3c, values_from = gdp_per_worker)
adjusted_gdp <- adjust_values(gdp_wide)
clean_gdp <- adjusted_gdp %>%
rename_with(.fn = ~ gsub("relative_", "", .x), .cols = starts_with("relative_"))
head(clean_gdp)# A tibble: 6 × 5
# Rowwise:
year CEB EAS LCN SSF
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 NA NA NA NA
2 1991 0.416 0.128 0.371 0.0946
3 1992 0.388 0.130 0.362 0.0892
4 1993 0.392 0.134 0.363 0.0850
5 1994 0.401 0.139 0.369 0.0821
6 1995 0.421 0.145 0.364 0.0820
Pivot Data Back to Long Format
We pivot the adjusted GDP data back to a long format and merge it with the self-employment data.
gdp_long <- clean_gdp %>%
pivot_longer(cols = -year, names_to = "iso3c")
merged_long <- merge(gdp_long, selected_data %>% select(iso3c, year, self_employment),
by = c("year", "iso3c"))
merged_long <- merged_long %>% rename(gdp_per_worker = value)
head(merged_long) year iso3c gdp_per_worker self_employment
1 1990 CEB NA NA
2 1990 EAS NA NA
3 1990 LCN NA NA
4 1990 SSF NA NA
5 1991 CEB 0.4155623 23.23896
6 1991 EAS 0.1282531 64.77200
Filter Data for Specific Regions and Years
We filter the data to include specific years for our analysis.
filtered_data <- merged_long %>%
filter(year > 1996)
head(filtered_data) year iso3c gdp_per_worker self_employment
1 1997 CEB 0.42312325 25.06862
2 1997 EAS 0.15091328 60.71787
3 1997 LCN 0.36562360 39.96472
4 1997 SSF 0.08194211 80.27294
5 1998 CEB 0.42089751 25.09654
6 1998 EAS 0.14543160 60.42258
Define Region Names and Custom Colors
We define long names for the region codes and custom colors for the plot.
iso3c_long_names <- c(CEB = "Central Europe and Baltics",
EAS = "East Asia and Pacific",
LCN = "Latin America and Caribbean",
SSF = "Sub-Saharan Africa",
HIC = "High income countries")
custom_colors <- c(CEB = "grey", EAS = blue, LCN = red, SSF = green, HIC = "black")Create Plot
We create a scatter plot to visualize the relationship between self-employment and GDP per worker, with labels for specific years.
label_data <- filtered_data %>%
filter(year %in% c(1997, 2010, 2022))
plot <- ggplot(filtered_data, aes(x = self_employment, y = gdp_per_worker, col = iso3c)) +
geom_point() +
scale_y_log10() +
theme_imf() +
xlab("Share of self-employment in total employment") +
geom_text_repel(data = label_data, aes(x = self_employment, y = gdp_per_worker,
label = year, col = iso3c), size = 3.5,
show.legend = FALSE, box.padding = unit(0.15, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = 'grey50') +
ylab("Ratio of labor productivity to that of US") +
ggtitle("Self-employment and labor productivity, 1997-2022") +
theme_imf_panel() +
scale_color_manual(values = custom_colors, labels = iso3c_long_names) +
theme(legend.position = c(0.7, 0.8)) +
theme(legend.title = element_blank())
print(plot)
Save Plot
Finally, we save the plot as a high-resolution PNG file.
ggsave(plot, filename = here("figures/fund-7--self-empl-lab-prod-reg-90-22.png"),
dpi = 600, width = 8.5 * 1.2, height = 5.5 * 1.2)Conclusion
In this tutorial, we analyzed the relationship between self-employment and labor productivity. We processed the data, created visualizations, and interpreted the results. This analysis provides insights into how self-employment rates are related to labor productivity across different regions.