library(haven)
library(countrycode)
library(dplyr)
library(here)
library(tidyr)
Scatterplots with Labels and Regression Lines
This tutorial shows how to make a scatterplot of the link between GDP per capita levels in 1994 and average annual GDP per capita growth from 1994 to 2019 for countries in America and Europe. We exclude a few outliers, including Bosnia and Herzegovina (BIH), Ireland (IRL), and Venezuela (VEN).
Set up Libraries
Load necessary libraries for data manipulation and plotting.
We load a file with code with pre-defined colors, including red and blue:
source(here("utils/theme_and_colors_IMF.R"))
Note that red and “red” are not the same. red is predefined in the “theme_and_colors_IMF.R” file as “#C00050”. “red” in R is defined as …
Import the Data
Read the Penn World Table data into R.
<- read_dta(here("databases/pwt100.dta")) pwt
Identify Continents and Exclude Certain Countries
Use the countrycode
library to map country codes to their respective continents, and exclude a few outliers. We also select only the variables and years we will be using:
<- pwt %>% rename(iso3c=countrycode) %>%
pwt mutate(continent=countrycode(iso3c,"iso3c", "continent")) %>%
filter(continent %in% c("Europe","Americas")) %>%
filter(!iso3c %in% c("BIH", "IRL", "VEN","CYM","NOR","LUX")) %>%
filter(year %in% c(1994,2019)) %>%
select(year,iso3c,rgdpna,pop,continent)
pwt
# A tibble: 152 × 5
year iso3c rgdpna pop continent
<dbl> <chr> <dbl> <dbl> <chr>
1 1994 ABW 2248. 0.0767 Americas
2 2019 ABW 3069. 0.106 Americas
3 1994 AIA 113. 0.00966 Americas
4 2019 AIA 223. 0.0149 Americas
5 1994 ALB 11977. 3.15 Europe
6 2019 ALB 37205. 2.88 Europe
7 1994 ARG 613237. 34.4 Americas
8 2019 ARG 975569 44.8 Americas
9 1994 ATG 811. 0.0672 Americas
10 2019 ATG 1512. 0.0971 Americas
# ℹ 142 more rows
Calculate GDP Per Capita and Growth
First calculate GDP per capita. We want to show GDP per capita in thousands of US dollars and therefore multiply by 0.001
<-pwt %>% mutate(ypop=0.001*rgdpna/pop) %>%
pwt2select(year,iso3c,ypop,continent)
pwt2
# A tibble: 152 × 4
year iso3c ypop continent
<dbl> <chr> <dbl> <chr>
1 1994 ABW 29.3 Americas
2 2019 ABW 28.9 Americas
3 1994 AIA 11.7 Americas
4 2019 AIA 15.0 Americas
5 1994 ALB 3.81 Europe
6 2019 ALB 12.9 Europe
7 1994 ARG 17.8 Americas
8 2019 ARG 21.8 Americas
9 1994 ATG 12.1 Americas
10 2019 ATG 15.6 Americas
# ℹ 142 more rows
Next, we use pivot_wider
to create a column for 1994 and a column for 2019
<-pwt2 %>% pivot_wider(names_from = year,values_from = ypop) %>%
pwt3rename(ypop_1994=`1994`, ypop_2019=`2019`)
pwt3
# A tibble: 76 × 4
iso3c continent ypop_1994 ypop_2019
<chr> <chr> <dbl> <dbl>
1 ABW Americas 29.3 28.9
2 AIA Americas 11.7 15.0
3 ALB Europe 3.81 12.9
4 ARG Americas 17.8 21.8
5 ATG Americas 12.1 15.6
6 AUT Europe 37.8 53.2
7 BEL Europe 33.1 46.3
8 BGR Europe 9.54 21.2
9 BHS Americas 28.6 32.7
10 BLR Europe 7.21 21.2
# ℹ 66 more rows
Now we calulate the average annual growth rate between 19994 and 2019:
$growth=100*(1/25)*log(pwt3$ypop_2019/pwt3$ypop_1994)
pwt3
=pwt3
growth_data growth_data
# A tibble: 76 × 5
iso3c continent ypop_1994 ypop_2019 growth
<chr> <chr> <dbl> <dbl> <dbl>
1 ABW Americas 29.3 28.9 -0.0616
2 AIA Americas 11.7 15.0 1.00
3 ALB Europe 3.81 12.9 4.89
4 ARG Americas 17.8 21.8 0.803
5 ATG Americas 12.1 15.6 1.02
6 AUT Europe 37.8 53.2 1.37
7 BEL Europe 33.1 46.3 1.34
8 BGR Europe 9.54 21.2 3.19
9 BHS Americas 28.6 32.7 0.533
10 BLR Europe 7.21 21.2 4.31
# ℹ 66 more rows
We save the data for later use, when we will make the same chart in ggplot
.
saveRDS(pwt3,file=here("databases/convergence.rds"))
We split the data into two groups:
<- growth_data[growth_data$continent == "Americas", ]
americas <- growth_data[growth_data$continent == "Europe", ] europe
Plot the Scatter Plot with Separate Regression Lines
Create a scatter plot with regression lines for each continent and label the points with the country code.
par(family="Segoe UI",mar=c(5,5,2,2))
# Base plot
plot(americas$ypop_1994, americas$growth,
xlab = "GDP per Capita in 1994",
ylab = "Average annual GDP per Capita \ngrowth (1994-2019)",
pch = 20, col = red, las=1,
xlim = range(growth_data$ypop_1994,na.rm = T),
ylim = range(growth_data$growth,na.rm = T))
title(main = "GDP per capita in 1994 and Average annual GDP per Capita Growth 1994-2019",
font.main=2,col.main=blue)
points(europe$ypop_1994, europe$growth, pch = 20, col = blue)
# Regression lines
<- lm(growth ~ ypop_1994, data = americas)
lm_americas <- lm(growth ~ ypop_1994, data = europe)
lm_europe
abline(lm_americas, col = red)
abline(lm_europe, col = blue)
# Add country labels
text(americas$ypop_1994, americas$growth, labels = americas$iso3c,
cex = 0.5, pos = 4, col = red)
text(europe$ypop_1994, europe$growth, labels = europe$iso3c,
cex = 0.5, pos = 4, col = blue)
# Legend without a box
legend("topright", pch = 20, col = c(red, blue),
legend = c("Americas", "Europe"), bty = "n")
This does not look bad. But we will show later that we can make it much prettier using ggplot.