library(wbstats)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggrepel)
library(here)
source(here("utils/theme_and_colors_IMF.R"))
Self employment and GDP per worker
Introduction
In this tutorial, we will analyze the relationship between self-employment and labor productivity (measured as GDP per worker) using data from the World Bank. We will process the data, create visualizations, and interpret the results.
Load Necessary Libraries
First, we need to load the necessary libraries. These include wbstats
for accessing World Bank data, dplyr
and tidyr
for data manipulation, and ggplot2
and ggrepel
for data visualization.
Load and Process Data
Self-Employment Data
We start by loading the self-employment data using the wb_data
function from the wbstats
package. We filter the data to include years from 1990 to 2022 and rename columns for clarity.
<- wb_data(indicator = "SL.EMP.SELF.ZS",country = "all")
self_employment
<- self_employment %>%
self_employment_data filter(date %in% 1990:2022) %>%
rename(year = date, self_employment = SL.EMP.SELF.ZS) %>%
filter(iso3c %in% c("EAS", "LCN", "SSF", "HIC", "CEB","USA")) %>%
select(iso3c,year,self_employment)
$year <- as.numeric(self_employment_data$year)
self_employment_data
head(self_employment_data)
# A tibble: 6 × 3
iso3c year self_employment
<chr> <dbl> <dbl>
1 CEB 2022 17.6
2 CEB 2021 17.7
3 CEB 2020 18.0
4 CEB 2019 17.8
5 CEB 2018 18.1
6 CEB 2017 18.4
GDP Per Worker Data
Next, we load the GDP per worker (labor productivity) data, filter it for the same years, and rename the columns.
<- wb_data(indicator = "SL.GDP.PCAP.EM.KD",country="all")
gdp_per_worker
<- gdp_per_worker %>%
gdp_data filter(date %in% 1990:2022) %>%
rename(gdp_per_worker = SL.GDP.PCAP.EM.KD, year = date) %>%
filter(iso3c %in% c("EAS", "LCN", "SSF", "HIC", "CEB","USA")) %>%
select(iso3c,year,gdp_per_worker)
$year <- as.numeric(gdp_data$year)
gdp_data
head(gdp_data)
# A tibble: 6 × 3
iso3c year gdp_per_worker
<chr> <dbl> <dbl>
1 CEB 1990 NA
2 CEB 1991 38321.
3 CEB 1992 36712.
4 CEB 1993 37502.
5 CEB 1994 38869.
6 CEB 1995 41017.
Merge and Process Data
We merge the self-employment and GDP per worker data into a single data frame and adjust the GDP values to be in thousands.
<- merge(gdp_data, self_employment_data, by = c("iso3c", "year"))
merged_data $gdp_per_worker <- merged_data$gdp_per_worker / 1000
merged_data
<- merged_data %>% select(iso3c, year, gdp_per_worker, self_employment)
selected_data
head(selected_data)
iso3c year gdp_per_worker self_employment
1 CEB 1990 NA NA
2 CEB 1991 38.32069 23.23896
3 CEB 1992 36.71158 23.70352
4 CEB 1993 37.50158 24.21145
5 CEB 1994 38.86858 24.30136
6 CEB 1995 41.01747 24.73645
Adjust GDP Values Relative to the USA
We pivot the GDP data to a wide format, adjust the values relative to the USA, and clean up the column names.
Certainly! Here is an expanded explanation for the adjust_values
function:
<- function(df) {
adjust_values %>%
df rowwise() %>%
mutate(across(.cols = -c(year, USA), .fns = ~ . / USA,
.names = "relative_{.col}")) %>%
select(year, starts_with("relative_"))
}
rowwise()
: This function is used to indicate that operations should be performed on a row-by-row basis. In this context, it ensures that the adjustment of GDP per worker values relative to the USA is done for each individual row, i.e., for each year and country combination.mutate(across(.cols = -c(year, USA), .fns = ~ . / USA, .names = "relative_{.col}"))
: This line performs the main adjustment operation.mutate()
: This function is used to create new columns or modify existing ones.across()
: This helper function specifies which columns to apply the function to..cols = -c(year, USA)
: This specifies that all columns exceptyear
andUSA
should be included in the operation..fns = ~ . / USA
: This specifies the function to be applied, which in this case is dividing each value by the corresponding value in theUSA
column. The tilde (~
) is shorthand for creating an anonymous function in R..names = "relative_{.col}"
: This specifies the naming convention for the new columns. Each new column name will be prefixed with “relative_” followed by the original column name, indicating that the values are relative to the USA.
select(year, starts_with("relative_"))
: This line selects the columns to keep in the resulting data frame.select(year, starts_with("relative_"))
: This retains only theyear
column and any columns whose names start with “relative_”. This effectively drops the original columns, leaving only the relative values and the year.
Example Workflow
- Original Data Frame:
- Contains columns for different countries’ GDP per worker, including the USA, for multiple years.
- Step-by-Step Adjustment:
rowwise()
: Ensures the function operates row by row.mutate(across(.cols = -c(year, USA), .fns = ~ . / USA, .names = "relative_{.col}"))
:- For each row, divide the GDP per worker of each country by the GDP per worker of the USA for that same row.
- Create new columns with names prefixed by “relative_” to indicate the adjustment.
select(year, starts_with("relative_"))
:- Keep only the
year
column and the newly created “relative_” columns, simplifying the data frame to the adjusted values.
- Keep only the
This function is useful for comparing GDP per worker across countries relative to the USA, providing a standardized way to understand relative productivity levels.
Applying the Function
<- selected_data %>%
gdp_wide select(iso3c, year, gdp_per_worker) %>%
pivot_wider(names_from = iso3c, values_from = gdp_per_worker)
<- adjust_values(gdp_wide)
adjusted_gdp
<- adjusted_gdp %>%
clean_gdp rename_with(.fn = ~ gsub("relative_", "", .x), .cols = starts_with("relative_"))
head(clean_gdp)
# A tibble: 6 × 5
# Rowwise:
year CEB EAS LCN SSF
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 NA NA NA NA
2 1991 0.416 0.128 0.371 0.0943
3 1992 0.388 0.129 0.362 0.0889
4 1993 0.392 0.134 0.363 0.0847
5 1994 0.401 0.138 0.369 0.0819
6 1995 0.420 0.144 0.364 0.0818
Pivot Data Back to Long Format
We pivot the adjusted GDP data back to a long format and merge it with the self-employment data.
<- clean_gdp %>%
gdp_long pivot_longer(cols = -year, names_to = "iso3c")
<- merge(gdp_long, selected_data %>% select(iso3c, year, self_employment),
merged_long by = c("year", "iso3c"))
<- merged_long %>% rename(gdp_per_worker = value)
merged_long
head(merged_long)
year iso3c gdp_per_worker self_employment
1 1990 CEB NA NA
2 1990 EAS NA NA
3 1990 LCN NA NA
4 1990 SSF NA NA
5 1991 CEB 0.4155066 23.23896
6 1991 EAS 0.1279286 64.77200
Filter Data for Specific Regions and Years
We filter the data to include specific years for our analysis.
<- merged_long %>%
filtered_data filter(year > 1996)
head(filtered_data)
year iso3c gdp_per_worker self_employment
1 1997 CEB 0.42313376 25.06862
2 1997 EAS 0.15024726 60.71787
3 1997 LCN 0.36560596 39.96472
4 1997 SSF 0.08171297 80.27294
5 1998 CEB 0.42093063 25.09654
6 1998 EAS 0.14471016 60.42258
Define Region Names and Custom Colors
We define long names for the region codes and custom colors for the plot.
<- c(CEB = "Central Europe and Baltics",
iso3c_long_names EAS = "East Asia and Pacific",
LCN = "Latin America and Caribbean",
SSF = "Sub-Saharan Africa",
HIC = "High income countries")
<- c(CEB = "grey", EAS = blue, LCN = red, SSF = green, HIC = "black") custom_colors
Create Plot
We create a scatter plot to visualize the relationship between self-employment and GDP per worker, with labels for specific years.
<- filtered_data %>%
label_data filter(year %in% c(1997, 2010, 2022))
<- ggplot(filtered_data, aes(x = self_employment, y = gdp_per_worker, col = iso3c)) +
plot geom_point() +
scale_y_log10() +
theme_imf() +
xlab("Share of self-employment in total employment") +
geom_text_repel(data = label_data, aes(x = self_employment, y = gdp_per_worker,
label = year, col = iso3c), size = 3.5,
show.legend = FALSE, box.padding = unit(0.15, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = 'grey50') +
ylab("Ratio of labor productivity to that of US") +
ggtitle("Self-employment and labor productivity, 1997-2022") +
theme_imf_panel() +
scale_color_manual(values = custom_colors, labels = iso3c_long_names) +
theme(legend.position = c(0.7, 0.8)) +
theme(legend.title = element_blank())
print(plot)
Save Plot
Finally, we save the plot as a high-resolution PNG file.
ggsave(plot, filename = here("figures/fund-7--self-empl-lab-prod-reg-90-22.png"),
dpi = 600, width = 8.5 * 1.2, height = 5.5 * 1.2)
Conclusion
In this tutorial, we analyzed the relationship between self-employment and labor productivity. We processed the data, created visualizations, and interpreted the results. This analysis provides insights into how self-employment rates are related to labor productivity across different regions.