Scatterplots with Labels and Regression Lines

This tutorial shows how to make a scatterplot of the link between GDP per capita levels in 1994 and average annual GDP per capita growth from 1994 to 2019 for countries in America and Europe. We exclude a few outliers, including Bosnia and Herzegovina (BIH), Ireland (IRL), and Venezuela (VEN).

Set up Libraries

Load necessary libraries for data manipulation and plotting.

library(haven)
library(countrycode)
library(dplyr)
library(here)
library(tidyr)

We load a file with code with pre-defined colors, including red and blue:

source(here("utils/theme_and_colors_IMF.R"))

Note that red and “red” are not the same. red is predefined in the “theme_and_colors_IMF.R” file as “#C00050”. “red” in R is defined as …

Import the Data

Read the Penn World Table data into R.

pwt <- read_dta(here("databases/pwt100.dta"))

Identify Continents and Exclude Certain Countries

Use the countrycode library to map country codes to their respective continents, and exclude a few outliers. We also select only the variables and years we will be using:

pwt <- pwt %>% rename(iso3c=countrycode) %>%
        mutate(continent=countrycode(iso3c,"iso3c", "continent")) %>%
        filter(continent %in% c("Europe","Americas")) %>%
        filter(!iso3c %in%  c("BIH", "IRL", "VEN","CYM","NOR","LUX")) %>%
        filter(year %in% c(1994,2019)) %>%
        select(year,iso3c,rgdpna,pop,continent)

pwt
# A tibble: 152 × 5
    year iso3c  rgdpna      pop continent
   <dbl> <chr>   <dbl>    <dbl> <chr>    
 1  1994 ABW     2248.  0.0767  Americas 
 2  2019 ABW     3069.  0.106   Americas 
 3  1994 AIA      113.  0.00966 Americas 
 4  2019 AIA      223.  0.0149  Americas 
 5  1994 ALB    11977.  3.15    Europe   
 6  2019 ALB    37205.  2.88    Europe   
 7  1994 ARG   613237. 34.4     Americas 
 8  2019 ARG   975569  44.8     Americas 
 9  1994 ATG      811.  0.0672  Americas 
10  2019 ATG     1512.  0.0971  Americas 
# ℹ 142 more rows

Calculate GDP Per Capita and Growth

First calculate GDP per capita. We want to show GDP per capita in thousands of US dollars and therefore multiply by 0.001

pwt2<-pwt %>% mutate(ypop=0.001*rgdpna/pop) %>%
  select(year,iso3c,ypop,continent)

pwt2
# A tibble: 152 × 4
    year iso3c  ypop continent
   <dbl> <chr> <dbl> <chr>    
 1  1994 ABW   29.3  Americas 
 2  2019 ABW   28.9  Americas 
 3  1994 AIA   11.7  Americas 
 4  2019 AIA   15.0  Americas 
 5  1994 ALB    3.81 Europe   
 6  2019 ALB   12.9  Europe   
 7  1994 ARG   17.8  Americas 
 8  2019 ARG   21.8  Americas 
 9  1994 ATG   12.1  Americas 
10  2019 ATG   15.6  Americas 
# ℹ 142 more rows

Next, we use pivot_wider to create a column for 1994 and a column for 2019

pwt3<-pwt2 %>% pivot_wider(names_from = year,values_from = ypop) %>%
  rename(ypop_1994=`1994`, ypop_2019=`2019`)

pwt3
# A tibble: 76 × 4
   iso3c continent ypop_1994 ypop_2019
   <chr> <chr>         <dbl>     <dbl>
 1 ABW   Americas      29.3       28.9
 2 AIA   Americas      11.7       15.0
 3 ALB   Europe         3.81      12.9
 4 ARG   Americas      17.8       21.8
 5 ATG   Americas      12.1       15.6
 6 AUT   Europe        37.8       53.2
 7 BEL   Europe        33.1       46.3
 8 BGR   Europe         9.54      21.2
 9 BHS   Americas      28.6       32.7
10 BLR   Europe         7.21      21.2
# ℹ 66 more rows

Now we calulate the average annual growth rate between 19994 and 2019:

pwt3$growth=100*(1/25)*log(pwt3$ypop_2019/pwt3$ypop_1994)

growth_data=pwt3
growth_data
# A tibble: 76 × 5
   iso3c continent ypop_1994 ypop_2019  growth
   <chr> <chr>         <dbl>     <dbl>   <dbl>
 1 ABW   Americas      29.3       28.9 -0.0616
 2 AIA   Americas      11.7       15.0  1.00  
 3 ALB   Europe         3.81      12.9  4.89  
 4 ARG   Americas      17.8       21.8  0.803 
 5 ATG   Americas      12.1       15.6  1.02  
 6 AUT   Europe        37.8       53.2  1.37  
 7 BEL   Europe        33.1       46.3  1.34  
 8 BGR   Europe         9.54      21.2  3.19  
 9 BHS   Americas      28.6       32.7  0.533 
10 BLR   Europe         7.21      21.2  4.31  
# ℹ 66 more rows

We save the data for later use, when we will make the same chart in ggplot.

saveRDS(pwt3,file=here("databases/convergence.rds"))

We split the data into two groups:

americas <- growth_data[growth_data$continent == "Americas", ]
europe <- growth_data[growth_data$continent == "Europe", ]

Plot the Scatter Plot with Separate Regression Lines

Create a scatter plot with regression lines for each continent and label the points with the country code.

par(family="Segoe UI",mar=c(5,5,2,2))


# Base plot
plot(americas$ypop_1994, americas$growth, 
     xlab = "GDP per Capita in 1994", 
     ylab = "Average annual GDP per Capita \ngrowth (1994-2019)",
     pch = 20, col = red, las=1,
     xlim = range(growth_data$ypop_1994,na.rm = T), 
     ylim = range(growth_data$growth,na.rm = T))

title(main = "GDP per capita in 1994 and Average annual GDP per Capita Growth 1994-2019",
      font.main=2,col.main=blue)

points(europe$ypop_1994, europe$growth, pch = 20, col = blue)

# Regression lines
lm_americas <- lm(growth ~ ypop_1994, data = americas)
lm_europe <- lm(growth ~ ypop_1994, data = europe)

abline(lm_americas, col = red)
abline(lm_europe, col = blue)

# Add country labels
text(americas$ypop_1994, americas$growth, labels = americas$iso3c, 
     cex = 0.5, pos = 4, col = red)

text(europe$ypop_1994, europe$growth, labels = europe$iso3c, 
     cex = 0.5, pos = 4, col = blue)

# Legend without a box
legend("topright",  pch = 20, col = c(red, blue), 
       legend = c("Americas", "Europe"), bty = "n")

This does not look bad. But we will show later that we can make it much prettier using ggplot.