Self employment and GDP per worker

Introduction

In this tutorial, we will analyze the relationship between self-employment and labor productivity (measured as GDP per worker) using data from the World Bank. We will process the data, create visualizations, and interpret the results.

Load Necessary Libraries

First, we need to load the necessary libraries. These include wbstats for accessing World Bank data, dplyr and tidyr for data manipulation, and ggplot2 and ggrepel for data visualization.

library(wbstats)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggrepel)
library(here)
source(here("utils/theme_and_colors_IMF.R"))

Load and Process Data

Self-Employment Data

We start by loading the self-employment data using the wb_data function from the wbstats package. We filter the data to include years from 1990 to 2022 and rename columns for clarity.

self_employment <- wb_data(indicator = "SL.EMP.SELF.ZS",country = "all") 

self_employment_data <- self_employment %>%
  filter(date %in% 1990:2022) %>%
  rename(year = date, self_employment = SL.EMP.SELF.ZS) %>%
  filter(iso3c %in% c("EAS", "LCN", "SSF", "HIC", "CEB","USA")) %>%
  select(iso3c,year,self_employment)



self_employment_data$year <- as.numeric(self_employment_data$year)

head(self_employment_data)

# A tibble: 6 × 3
  iso3c  year self_employment
  <chr> <dbl>           <dbl>
1 CEB    2022            17.6
2 CEB    2021            17.8
3 CEB    2020            18.0
4 CEB    2019            17.8
5 CEB    2018            18.1
6 CEB    2017            18.5

GDP Per Worker Data

Next, we load the GDP per worker (labor productivity) data, filter it for the same years, and rename the columns.

gdp_per_worker <- wb_data(indicator = "SL.GDP.PCAP.EM.KD",country="all")

gdp_data <- gdp_per_worker %>%
  filter(date %in% 1990:2022) %>%
  rename(gdp_per_worker = SL.GDP.PCAP.EM.KD, year = date) %>%
   filter(iso3c %in% c("EAS", "LCN", "SSF", "HIC", "CEB","USA")) %>%
  select(iso3c,year,gdp_per_worker)

gdp_data$year <- as.numeric(gdp_data$year)

head(gdp_data)

# A tibble: 6 × 3
  iso3c  year gdp_per_worker
  <chr> <dbl>          <dbl>
1 CEB    1990            NA 
2 CEB    1991         38326.
3 CEB    1992         36717.
4 CEB    1993         37507.
5 CEB    1994         38874.
6 CEB    1995         41023.

Merge and Process Data

We merge the self-employment and GDP per worker data into a single data frame and adjust the GDP values to be in thousands.

merged_data <- merge(gdp_data, self_employment_data, by = c("iso3c", "year"))
merged_data$gdp_per_worker <- merged_data$gdp_per_worker / 1000

selected_data <- merged_data %>% select(iso3c, year, gdp_per_worker, self_employment)

head(selected_data)

  iso3c year gdp_per_worker self_employment
1   CEB 1990             NA              NA
2   CEB 1991       38.32582        23.23896
3   CEB 1992       36.71664        23.70352
4   CEB 1993       37.50670        24.21145
5   CEB 1994       38.87385        24.30136
6   CEB 1995       41.02278        24.73645

Adjust GDP Values Relative to the USA

We pivot the GDP data to a wide format, adjust the values relative to the USA, and clean up the column names.

Certainly! Here is an expanded explanation for the adjust_values function:

adjust_values <- function(df) {
  df %>%
    rowwise() %>%
    mutate(across(.cols = -c(year, USA), .fns = ~ . / USA, 
                  .names = "relative_{.col}")) %>%
    select(year, starts_with("relative_"))
}

rowwise(): This function is used to indicate that operations should be performed on a row-by-row basis. In this context, it ensures that the adjustment of GDP per worker values relative to the USA is done for each individual row, i.e., for each year and country combination.
mutate(across(.cols = -c(year, USA), .fns = ~ . / USA, .names = "relative_{.col}")): This line performs the main adjustment operation.
- mutate(): This function is used to create new columns or modify existing ones.
- across(): This helper function specifies which columns to apply the function to.
- .cols = -c(year, USA): This specifies that all columns except year and USA should be included in the operation.
- .fns = ~ . / USA: This specifies the function to be applied, which in this case is dividing each value by the corresponding value in the USA column. The tilde (~) is shorthand for creating an anonymous function in R.
- .names = "relative_{.col}": This specifies the naming convention for the new columns. Each new column name will be prefixed with “relative_” followed by the original column name, indicating that the values are relative to the USA.
select(year, starts_with("relative_")): This line selects the columns to keep in the resulting data frame.
- select(year, starts_with("relative_")): This retains only the year column and any columns whose names start with “relative_”. This effectively drops the original columns, leaving only the relative values and the year.

Example Workflow

Original Data Frame:
- Contains columns for different countries’ GDP per worker, including the USA, for multiple years.
Step-by-Step Adjustment:
- rowwise(): Ensures the function operates row by row.
- mutate(across(.cols = -c(year, USA), .fns = ~ . / USA, .names = "relative_{.col}")):
  - For each row, divide the GDP per worker of each country by the GDP per worker of the USA for that same row.
  - Create new columns with names prefixed by “relative_” to indicate the adjustment.
- select(year, starts_with("relative_")):
  - Keep only the year column and the newly created “relative_” columns, simplifying the data frame to the adjusted values.

This function is useful for comparing GDP per worker across countries relative to the USA, providing a standardized way to understand relative productivity levels.

Applying the Function

gdp_wide <- selected_data %>%
  select(iso3c, year, gdp_per_worker) %>%
  pivot_wider(names_from = iso3c, values_from = gdp_per_worker)

adjusted_gdp <- adjust_values(gdp_wide)

clean_gdp <- adjusted_gdp %>%
  rename_with(.fn = ~ gsub("relative_", "", .x), .cols = starts_with("relative_"))

head(clean_gdp)

# A tibble: 6 × 5
# Rowwise: 
   year    CEB    EAS    LCN     SSF
  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1  1990 NA     NA     NA     NA     
2  1991  0.416  0.128  0.371  0.0946
3  1992  0.388  0.130  0.362  0.0892
4  1993  0.392  0.134  0.363  0.0850
5  1994  0.401  0.139  0.369  0.0821
6  1995  0.421  0.145  0.364  0.0820

Pivot Data Back to Long Format

We pivot the adjusted GDP data back to a long format and merge it with the self-employment data.

gdp_long <- clean_gdp %>%
  pivot_longer(cols = -year, names_to = "iso3c")

merged_long <- merge(gdp_long, selected_data %>% select(iso3c, year, self_employment), 
                     by = c("year", "iso3c"))
merged_long <- merged_long %>% rename(gdp_per_worker = value)

head(merged_long)

  year iso3c gdp_per_worker self_employment
1 1990   CEB             NA              NA
2 1990   EAS             NA              NA
3 1990   LCN             NA              NA
4 1990   SSF             NA              NA
5 1991   CEB      0.4155623        23.23896
6 1991   EAS      0.1282531        64.77200

Filter Data for Specific Regions and Years

We filter the data to include specific years for our analysis.

filtered_data <- merged_long %>% 
  filter(year > 1996)

head(filtered_data)

  year iso3c gdp_per_worker self_employment
1 1997   CEB     0.42312325        25.06862
2 1997   EAS     0.15091328        60.71787
3 1997   LCN     0.36562360        39.96472
4 1997   SSF     0.08194211        80.27294
5 1998   CEB     0.42089751        25.09654
6 1998   EAS     0.14543160        60.42258

Define Region Names and Custom Colors

We define long names for the region codes and custom colors for the plot.

iso3c_long_names <- c(CEB = "Central Europe and Baltics", 
                      EAS = "East Asia and Pacific", 
                      LCN = "Latin America and Caribbean", 
                      SSF = "Sub-Saharan Africa", 
                      HIC = "High income countries")
custom_colors <- c(CEB = "grey", EAS = blue, LCN = red, SSF = green, HIC = "black")

Create Plot

We create a scatter plot to visualize the relationship between self-employment and GDP per worker, with labels for specific years.

label_data <- filtered_data %>%
  filter(year %in% c(1997, 2010, 2022))

plot <- ggplot(filtered_data, aes(x = self_employment, y = gdp_per_worker, col = iso3c)) +
  geom_point() +
  scale_y_log10() +
  theme_imf() +
  xlab("Share of self-employment in total employment") +
  geom_text_repel(data = label_data, aes(x = self_employment, y = gdp_per_worker, 
                                         label = year, col = iso3c), size = 3.5, 
                  show.legend = FALSE, box.padding = unit(0.15, "lines"), 
                  point.padding = unit(0.5, "lines"), 
                  segment.color = 'grey50') +
  ylab("Ratio of labor productivity to that of US") +
  ggtitle("Self-employment and labor productivity, 1997-2022") +
  theme_imf_panel() +
  scale_color_manual(values = custom_colors, labels = iso3c_long_names) +
  theme(legend.position = c(0.7, 0.8)) +
  theme(legend.title = element_blank())

print(plot)

Save Plot

Finally, we save the plot as a high-resolution PNG file.

ggsave(plot, filename = here("figures/fund-7--self-empl-lab-prod-reg-90-22.png"), 
       dpi = 600, width = 8.5 * 1.2, height = 5.5 * 1.2)

Conclusion

In this tutorial, we analyzed the relationship between self-employment and labor productivity. We processed the data, created visualizations, and interpreted the results. This analysis provides insights into how self-employment rates are related to labor productivity across different regions.