Creating Boxplots with ggplot2 in R

Introduction

In this tutorial, we will learn how to create boxplots using the ggplot2 package in R. Boxplots are useful for visualizing the distribution of a dataset and identifying potential outliers. We will use the built-in mtcars dataset for our example.

We first need to load ggplot2:

library(ggplot2)

Understanding the Data

The mtcars dataset contains information about various car models, including their miles per gallon (mpg), number of cylinders, horsepower, and other characteristics. Let’s take a look at the first few rows of the dataset.

# View the first few rows of the mtcars dataset
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Creating a Basic Boxplot

We will start by creating a basic boxplot to visualize the distribution of miles per gallon (mpg) for cars with different numbers of cylinders.

# Basic boxplot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
  geom_boxplot() +
  xlab("Number of Cylinders") +
  ylab("Miles per Gallon (mpg)") +
  ggtitle("Boxplot of MPG by Number of Cylinders")

Explanation of Boxplot Components

  • Line in the middle of the box (Median): This line represents the median value of the data.
  • Box (Interquartile Range - IQR): The box represents the IQR, which is the range between the first quartile (25th percentile) and the third quartile (75th percentile).
  • Whiskers: The lines extending from the box (whiskers) represent the range of the data within 1.5 times the IQR from the first and third quartiles.
  • Outliers: Points beyond the whiskers are considered outliers and are plotted individually.

Adding Colors to the Boxplot

We can add colors to our boxplot to differentiate between the groups more easily.

# Colored boxplot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
  geom_boxplot() +
  xlab("Number of Cylinders") +
  ylab("Miles per Gallon (mpg)") +
  ggtitle("Boxplot of MPG by Number of Cylinders") +
  scale_fill_brewer(palette = "Set3")

In this plot: - fill = as.factor(cyl) inside aes() is used to fill the boxes with different colors based on the number of cylinders. - scale_fill_brewer(palette = "Set3") applies a color palette from the RColorBrewer package.

Adding Points to the Boxplot

We can overlay the actual data points on the boxplot to see the distribution more clearly.

# Boxplot with points
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  xlab("Number of Cylinders") +
  ylab("Miles per Gallon (mpg)") +
  ggtitle("Boxplot of MPG by Number of Cylinders with Data Points") +
  scale_fill_brewer(palette = "Set3")

In this plot: - geom_jitter(width = 0.2) adds points to the plot with some random noise to avoid overplotting.

Adding Mean Points to the Boxplot

We can also add points to represent the mean values of each group.

# Boxplot with mean points
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +
  xlab("Number of Cylinders") +
  ylab("Miles per Gallon (mpg)") +
  ggtitle("Boxplot of MPG by Number of Cylinders with Mean Points") +
  scale_fill_brewer(palette = "Set3")

In this plot: - stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") adds points to represent the mean values. The shape, size, and color parameters customize the appearance of these points.

Customizing the Boxplot

Finally, let’s customize the boxplot by changing the theme and adding labels for clarity.

# Customized boxplot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
  geom_boxplot() +
  geom_jitter(width = 0.2, aes(color = as.factor(cyl))) +
  stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +
  xlab("Number of Cylinders") +
  ylab("Miles per Gallon (mpg)") +
  ggtitle("Customized Boxplot of MPG by Number of Cylinders") +
  scale_fill_brewer(palette = "Set3") +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5)
  )

Explanation of scale_fill and scale_color

  • scale_fill: This function is used to modify the colors of the “filled” elements in a plot, such as the boxes in a boxplot, bars in a bar plot, or areas in an area plot. In our examples, scale_fill_brewer(palette = "Set3") is used to apply a color palette to the fill color of the boxes in the boxplot.

  • scale_color: This function is used to modify the colors of the “outlined” or “point” elements in a plot, such as points, lines, or the borders of shapes. In our customized boxplot, scale_color_brewer(palette = "Dark2") is used to apply a color palette to the points added by geom_jitter.

By using scale_fill and scale_color, you can customize the appearance of your plots to make them more visually appealing and informative.

Conclusion

In this tutorial, we have learned how to create and customize boxplots using ggplot2 in R. Boxplots are a powerful tool for visualizing the distribution of data and identifying potential outliers. By combining boxplots with other ggplot2 functionalities, you can create informative and visually appealing data visualizations.