library(ggplot2)
Creating Boxplots with ggplot2 in R
Introduction
In this tutorial, we will learn how to create boxplots using the ggplot2
package in R. Boxplots are useful for visualizing the distribution of a dataset and identifying potential outliers. We will use the built-in mtcars
dataset for our example.
We first need to load ggplot2
:
Understanding the Data
The mtcars
dataset contains information about various car models, including their miles per gallon (mpg), number of cylinders, horsepower, and other characteristics. Let’s take a look at the first few rows of the dataset.
# View the first few rows of the mtcars dataset
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Creating a Basic Boxplot
We will start by creating a basic boxplot to visualize the distribution of miles per gallon (mpg) for cars with different numbers of cylinders.
# Basic boxplot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
geom_boxplot() +
xlab("Number of Cylinders") +
ylab("Miles per Gallon (mpg)") +
ggtitle("Boxplot of MPG by Number of Cylinders")
Explanation of Boxplot Components
- Line in the middle of the box (Median): This line represents the median value of the data.
- Box (Interquartile Range - IQR): The box represents the IQR, which is the range between the first quartile (25th percentile) and the third quartile (75th percentile).
- Whiskers: The lines extending from the box (whiskers) represent the range of the data within 1.5 times the IQR from the first and third quartiles.
- Outliers: Points beyond the whiskers are considered outliers and are plotted individually.
Adding Colors to the Boxplot
We can add colors to our boxplot to differentiate between the groups more easily.
# Colored boxplot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
geom_boxplot() +
xlab("Number of Cylinders") +
ylab("Miles per Gallon (mpg)") +
ggtitle("Boxplot of MPG by Number of Cylinders") +
scale_fill_brewer(palette = "Set3")
In this plot: - fill = as.factor(cyl)
inside aes()
is used to fill the boxes with different colors based on the number of cylinders. - scale_fill_brewer(palette = "Set3")
applies a color palette from the RColorBrewer package.
Adding Points to the Boxplot
We can overlay the actual data points on the boxplot to see the distribution more clearly.
# Boxplot with points
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
geom_boxplot() +
geom_jitter(width = 0.2) +
xlab("Number of Cylinders") +
ylab("Miles per Gallon (mpg)") +
ggtitle("Boxplot of MPG by Number of Cylinders with Data Points") +
scale_fill_brewer(palette = "Set3")
In this plot: - geom_jitter(width = 0.2)
adds points to the plot with some random noise to avoid overplotting.
Adding Mean Points to the Boxplot
We can also add points to represent the mean values of each group.
# Boxplot with mean points
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
geom_boxplot() +
stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +
xlab("Number of Cylinders") +
ylab("Miles per Gallon (mpg)") +
ggtitle("Boxplot of MPG by Number of Cylinders with Mean Points") +
scale_fill_brewer(palette = "Set3")
In this plot: - stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red")
adds points to represent the mean values. The shape
, size
, and color
parameters customize the appearance of these points.
Customizing the Boxplot
Finally, let’s customize the boxplot by changing the theme and adding labels for clarity.
# Customized boxplot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
geom_boxplot() +
geom_jitter(width = 0.2, aes(color = as.factor(cyl))) +
stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +
xlab("Number of Cylinders") +
ylab("Miles per Gallon (mpg)") +
ggtitle("Customized Boxplot of MPG by Number of Cylinders") +
scale_fill_brewer(palette = "Set3") +
scale_color_brewer(palette = "Dark2") +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5)
)
Explanation of scale_fill
and scale_color
scale_fill
: This function is used to modify the colors of the “filled” elements in a plot, such as the boxes in a boxplot, bars in a bar plot, or areas in an area plot. In our examples,scale_fill_brewer(palette = "Set3")
is used to apply a color palette to the fill color of the boxes in the boxplot.scale_color
: This function is used to modify the colors of the “outlined” or “point” elements in a plot, such as points, lines, or the borders of shapes. In our customized boxplot,scale_color_brewer(palette = "Dark2")
is used to apply a color palette to the points added bygeom_jitter
.
By using scale_fill
and scale_color
, you can customize the appearance of your plots to make them more visually appealing and informative.
Conclusion
In this tutorial, we have learned how to create and customize boxplots using ggplot2
in R. Boxplots are a powerful tool for visualizing the distribution of data and identifying potential outliers. By combining boxplots with other ggplot2 functionalities, you can create informative and visually appealing data visualizations.