4 Distributions
library(tidyverse)
library(dcldata)
4.1 Introduction
Take a look at the following cheat sheet sections before reading this chapter.
- Geoms:
geom_histogram()
geom_freqpoly()
geom_density()
geom_boxplot()
geom_violin()
geom_vline()
geom_hline()
A common first step when carrying out exploratory data analysis is examining distributions of continuous variables. For example, in this chapter, we’ll explore the ages of Congress members. To get a sense of what distributions can show, here are some possible hypotheses about the ages of Congress members (some more likely than others).
- There is no age bracket more populous than any another.
- The average age is around 60.
- There are a lot of Congress members who are around 30, and a lot who around 70.
Before we dive into the Congress data, however, we’ll introduce some mechanics. In the following sections, you’ll learn about layering geoms and the basics of theme()
.
4.2 Mechanics
4.2.1 Layering
The following plot shows the relationship of class
to hwy
using boxplots.
%>%
mpg mutate(class = fct_reorder(class, hwy)) %>%
ggplot(aes(class, hwy)) +
geom_boxplot()
We’ll explain the specifics of boxplots later on in this chapter.
The black line in the middle of each box represents the median value for that class. It would also be helpful to see the median hwy
value for all car classes. To display this on our plot, we’ll add a reference line using geom_hline()
. geom_hline()
creates horizontal lines. geom_vline()
, which creates vertical lines, and geom_abline()
, which creates lines with a given slope and intercept, are also useful for creating reference lines.
In geom_hline()
, the yintercept
argument controls the position of the horizontal line. In this case, we want a horizontal line indicating the median hwy
value, so we’ll set yintercept
equal to median(hwy, na.rm = TRUE)
.
%>%
mpg mutate(class = fct_reorder(class, hwy)) %>%
ggplot(aes(class, hwy)) +
geom_boxplot() +
geom_hline(aes(yintercept = median(hwy, na.rm = TRUE)), color = "red")
Notice that the horizontal line is on top of the boxplots. We added geom_hline()
our plot after geom_boxplot()
, so the boxplots were drawn first, and then the line drawn on top. If we switch the order, the boxplots will lie on top of the line.
%>%
mpg mutate(class = fct_reorder(class, hwy)) %>%
ggplot(aes(class, hwy)) +
geom_hline(aes(yintercept = median(hwy, na.rm = TRUE)), color = "red") +
geom_boxplot()
Placing the line on top of the boxplots can obscure information, so it’s generally better to place the line underneath.
You can add as many layers to a gpplot2 plot as you’d like. The different layers also don’t have to use the same data.
# Source: https://en.wikipedia.org/wiki/Corporate_average_fuel_economy
<-
cafe_2008 tribble(
~class, ~mpg,
"Passenger", 27.5,
"Truck", 22.4
)
%>%
mpg mutate(class = fct_reorder(class, hwy)) %>%
ggplot() +
geom_hline(aes(yintercept = mpg, color = class), data = cafe_2008) +
geom_boxplot(aes(class, hwy)) +
labs(color = "2008 CAFE\nstandard")
4.2.2 theme()
Just like every ggplot2 plot has a coordinate system, every ggplot2 plot also has a theme. The theme controls the parts of the plot unrelated to the data, like the plot background color, tick mark length, and the color, size, and angle of the labeling text.
You can edit your plot’s theme with the theme()
function. Themes have many components, and so theme()
takes many arguments. Take a look at tidyverse page on themes to get a sense of what’s possible to change.
Two theme()
arguments you’ll commonly use are axis.text.x
and axis.text.y
. These arguments control the appearance of the axis tick mark labels. Next, we’ll show you how to angle the x-axis tick mark labels to reduce overlap.
The x-axis labels overlap on the following visualization.
%>%
mpg mutate(manufacturer = fct_reorder(manufacturer, hwy)) %>%
ggplot(aes(manufacturer, hwy)) +
geom_boxplot()
In the Discrete-continuous relationships chapter, you saw how we can flip a plot and put the labels on the y-axis. However, for reasons you’ll learn later on in this chapter, in this case it’s better to keep the boxplots oriented vertically.
Instead of putting the labels on the y-axis, we’ll use theme()
’s axis.text.x
argument to angle the labels.
%>%
mpg mutate(manufacturer = fct_reorder(manufacturer, hwy)) %>%
ggplot(aes(manufacturer, hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45))
Take a look at the code. Notice that we had to use the function element_text()
. element_text()
is an element function. You’ll need element functions to change most of theme()
’s arguments. If you look at themes()
’s documentation, you’ll see that the arguments are arranged into groups. At the beginning of each group, there’s a reference to the appropriate element function. axis.text.x
controls text, and uses element_text()
, but axis.line
controls the appearance of a line, and so uses element_line()
. Like theme()
, element_text()
has many arguments. To change the angle of the text, we only needed one: angle
.
Angling makes the axis text easier to read, but the text is now far away from the x-axis. The plot would be better if the labels were closer to the boxplots they represent. We can adjust the text justification with element_text()
’s hjust
argument.
%>%
mpg mutate(manufacturer = fct_reorder(manufacturer, hwy)) %>%
ggplot(aes(manufacturer, hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
Setting hjust
to 0 aligns text to the left, which causes our labels to move closer to x-axis. hjust = 1
aligns text right.
You’ll learn about the other element functions, and other useful theme()
changes, later on.
4.3 One distribution
In this section, we’ll use data on Congress members serving in 2019.
congress#> # A tibble: 535 x 7
#> name state division chamber party age gender
#> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 Don Young AK Pacific House Republican 85.6 M
#> 2 Lisa Murkowski AK Pacific Senate Republican 61.6 F
#> 3 Dan Sullivan AK Pacific Senate Republican 54.2 M
#> 4 Terri Sewell AL East South Central House Democrat 54.0 F
#> 5 Robert Aderholt AL East South Central House Republican 53.5 M
#> 6 Mo Brooks AL East South Central House Republican 64.7 M
#> # … with 529 more rows
One of our hypotheses from the introduction was the average Congress members is around 60. “Average” is, however, ambiguous. Do we mean the median or the mean?
For some data, the mean and median are close together, but that won’t always be the case. Means are sensitive to outliers, while medians are more robust. As an example, here’s a vector x
.
<- c(1, 2, 10, 15) x
The mean is the result of summing the elements of x
and dividing by the length of x
.
mean(x)
#> [1] 7
The median represents the midpoint of x
. Half of the data is greater than 6, and half is less than 6.
median(x)
#> [1] 6
Right now, the mean and median of x
are close together, but if we add an outlier the mean explodes, but the median only increases by a small amount.
<- c(1, 2, 10, 15, 1e6)
x
mean(x)
#> [1] 2e+05
median(x)
#> [1] 10
We’ll typically use medians instead of means for averages and references lines because of this behavior.
summary()
will tell us the mean and median age in Congress.
summary(congress$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 29.2 49.6 59.4 58.6 67.4 85.6
The mean and median are close together, both between 58 and 60. Our hypothesis that the average age is 60 was pretty close.
Another interesting finding is that the House is slightly younger than the Senate.
%>%
congress group_by(chamber) %>%
summarize(median_age = median(age))
#> # A tibble: 2 x 2
#> chamber median_age
#> <chr> <dbl>
#> 1 House 58.0
#> 2 Senate 63.6
There are also some very young Congress members,
%>%
congress filter(age < 32) %>%
arrange(age)
#> # A tibble: 3 x 7
#> name state division chamber party age gender
#> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 Alexandria Ocasio-Cortez NY Middle Atlantic House Democr… 29.2 F
#> 2 Abby Finkenauer IA West North Central House Democr… 30.0 F
#> 3 Katie Hill CA Pacific House Democr… 31.4 F
and the oldest member is 85.
%>%
congress slice_max(age)
#> # A tibble: 1 x 7
#> name state division chamber party age gender
#> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 Don Young AK Pacific House Republican 85.6 M
These summary calculations can only tell us so much about the data. We can get a richer understanding of the data by visualizing the age distribution, both for Congress as a whole and for the different chambers, states, parties, etc.
In this section, we’ll focus just on visualizing the distribution of ages for all of Congress. First, let’s try a histogram.
%>%
congress ggplot(aes(age)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There’s a message above our plot telling us that geom_histogram()
is using bins = 30
, but we should choose a better value of binwidth
. Both the bins
and binwidth
arguments of geom_histogram()
control the width of the bars you see on the plot. bins
controls the number of bars. By default, geom_histogram()
creates 30 bins/bars, and then divides up the data so that each bar spans the same number of whatever units are plotted along the x-axis. Here, those units are years, and each bar represents around 2 years. The binwidth
argument of geom_histogram()
allows you to set the width of the bars directly.
If you don’t specify either bins
or binwidth
when you use geom_histogram()
, you will always get the message you see above. This message reminds you that it is always a good idea to think about your binwidth, and you will rarely want the default.
Let’s try some different binwidths.
%>%
congress ggplot(aes(age)) +
geom_histogram(binwidth = .1) +
stamp("Bad")
A binwidth of 0.1 years is too small. Notice how it’s difficult to see the overall shape of the distribution and the plot appears very “peaky.” Also, many of the bars only represent a single person.
If we swing too far in the opposite direction, however, we’ll lose information.
%>%
congress ggplot(aes(x = age)) +
geom_histogram(binwidth = 20) +
stamp("Bad")
20 years is too large of a chunk to be meaningful in this context.
In this case, the original 30 bins wasn’t too far off from ideal. We’ll make a minor adjustment so that each bin represents 5 years.
%>%
congress ggplot(aes(x = age)) +
geom_histogram(binwidth = 5, boundary = 0)
The argument boundary = 0
directs the bin boundaries to be at 0 plus multiples of the binwidth
; in this case, the bin boundaries are at 25, 30, 35, …, 90.
We can improve this histogram by adding a reference line at the median.
%>%
congress ggplot(aes(age)) +
geom_histogram(binwidth = 5, boundary = 0) +
geom_vline(aes(xintercept = median(age)), color = "red")
For simple, one-dimensional distributions, histograms are generally a good bet. They are simple to interpret and can quickly give you a sense of the data.
However, there are other ways to visualize a single distribution. geom_freqpoly()
, for example, is similar to geom_histogram()
, but instead of encoding the number of people in each age bin with a bar, it uses points along a line.
%>%
congress ggplot(aes(age)) +
geom_freqpoly(binwidth = 5) +
geom_vline(aes(xintercept = median(age)), color = "red")
You can think of frequency polygons as “histograms with lines.” Just as you have to pick a binwidth for histograms, you also have to pick one for frequency polygons.
If you’re just visualizing a single distribution, a histogram is probably better. The bars make it easier to connect counts with ages, as well as see the overall shape of the distribution. However, as you’ll see later, frequency polygons can be useful for visualizing multiple distributions at once.
geom_density()
creates a density plot and is another option for visualizing a single distribution. Density plots are essentially smoothed histograms.
%>%
congress ggplot(aes(age)) +
geom_density()
Density plots lack the fine-grained detail of frequency polygons. Sometimes, you’ll want to lose detail and focus on the overall trend, but, in other situations, density plots will hide important information. You should also avoid using density plots if you don’t have very much data. We therefore recommend using density plots only in specific scenarios.
One such scenario is visualizing simulated data.
set.seed(916)
<- tibble(x = rnorm(n = 1e4))
sim_1
%>%
sim_1 ggplot(aes(x)) +
geom_density()
We’ll introduce a second scenario in the next section.
4.4 Multiple distributions
You’ll often want to visualize more than one distribution. For example, we might want to compare the age distributions for the different political parties in congress
.
A first pass at a plot might involve using a histogram with fill
to represent the different parties.
%>%
congress ggplot(aes(age, fill = party)) +
geom_histogram(binwidth = 5) +
scale_fill_discrete(direction = -1) +
stamp("Bad")
Like stacked bar plots, stacked histograms aren’t a very good idea. Stacking makes it difficult to figure out how many Democrats are in each bin. The line
scale_fill_discrete(direction = -1)
changes the direction of the fill colors, making the Democrats blue and the Republicans red.
One option is to facet.
%>%
congress ggplot(aes(age)) +
geom_histogram(binwidth = 5) +
facet_grid(cols = vars(party)) +
stamp("Bad")
The side-by-side histograms are difficult to compare. Vertical facets are much better.
%>%
congress ggplot(aes(age)) +
geom_histogram(binwidth = 5) +
facet_grid(rows = vars(party))
Another option is overlaying frequency polygons.
%>%
congress ggplot(aes(age, color = party)) +
geom_freqpoly(binwidth = 5) +
scale_color_discrete(direction = -1)
The frequency polygons make it even easier to compare the three distributions. Democrats and Republicans are similar, but Democrats are more spread out and are slightly older. The few members in the “Other” category are older than the average Democrat or Republican.
The same pattern works for visualizing the age distributions of the two chambers of Congress.
%>%
congress ggplot(aes(age, color = chamber)) +
geom_freqpoly(binwidth = 5)
Interestingly, they both have a very similar shape. However, because the House has hundreds more members than the Senate, the distributions are far away from each other and it is difficult to directly compare them.
We can solve this issue by mapping the density of each age group to the y-axis. Like geom_histogram()
and geom_bar()
, geom_freqpoly()
computes y-values from the data, and so you don’t need to supply a y
aesthetic. By default, the y
aesthetic is set to stat(count)
. stat(count)
computes the number of items in each bin. In order to get geom_freqpoly()
to plot densities instead of counts, we’ll set y = stat(density)
, which computes the density of items in each bin.
%>%
congress ggplot(aes(age, y = stat(density), color = chamber)) +
geom_freqpoly(binwidth = 5)
Now, instead of mapping the number of people in each age bin to the y-axis, geom_freqpoly()
uses the density of the people in each bin. It’s now clear that, although their shapes are similar, the House is a bit younger and more evenly distributed.
In the previous section, we mentioned a use case for geom_density()
. One advantage of geom_density()
is that you can fill in the curves, which makes them easier to see.
%>%
congress ggplot(aes(age, y = stat(density), color = chamber, fill = chamber )) +
geom_density() +
stamp("Bad")
By default, the fill regions are opaque, so to be useful we need to change the transparency with alpha
.
%>%
congress ggplot(aes(age, y = stat(density), color = chamber, fill = chamber )) +
geom_density(alpha = 0.5)
This approach works best when you have sufficient data and the distributions can be distinguished from each other.
set.seed(916)
<-
sim_2 bind_rows(
tibble(mean = "0", x = rnorm(n = 1e4, mean = 0, sd = 1)),
tibble(mean = "1", x = rnorm(n = 1e5, mean = 5, sd = 2))
)
%>%
sim_2 ggplot(aes(x, color = mean, fill = mean)) +
geom_density(alpha = 0.5)
geom_density()
, like geom_freqpoly()
, doesn’t require that you specify a y
aesthetic. By default, the y
aesthetic for geom_density()
is stat(density)
.
In some situations, you’ll want to use stat(count)
instead of stat(density)
. For example, notice that in sim_2
, there are more observations in the “5” group than in the “0” group. stat(density)
doesn’t show the number of observations in each group, so you can’t see this difference in the above plot. If we use stat(count)
, however, this difference is obvious.
%>%
sim_2 ggplot(aes(x, y = stat(count), color = mean, fill = mean)) +
geom_density(alpha = 0.5)
Use stat(count)
instead of stat(density)
if you want to highlight the number of observations in each group. Otherwise, you can use the default stat(density)
.
If we want to visualize more than 2 or 3 distributions at once, we’ll need a new strategy. The variable division
contains the U.S. Census division for each Congress member’s state. There are more values in division
than in chamber
or party
, so the geom_freqpoly()
plus color combination doesn’t work well.
%>%
congress ggplot(aes(age, stat(density), color = division)) +
geom_freqpoly(binwidth = 8) +
stamp("Bad")
There are too many colors and the lines overlap too much for this visualization to be very useful.
Instead of encoding division
with color, we can encode with position along the x-axis by using boxplots.
Boxplots summarize data.
Visual summaries like boxplots are helpful because they highlight important features of the data. Like any summary, however, they don’t provide a full picture of the data. Boxplots can’t fully convey the shape of the distribution in the same way that a histogram or density plot does. Sometimes this trade-off will be worth it, and sometimes it won’t.
Because there are so many different values in division
, boxplots are a good strategy. Summarizing the data will allow us to see the differences between the divisions that were hidden in the geom_freqpoly()
plot.
%>%
congress mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_boxplot() +
stamp("Bad")
By now, you should be used to the overlapping labels problem. We could try putting the labels on the y-axis.
%>%
congress mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(age, division)) +
geom_boxplot() +
stamp("Bad")
However, flipping makes the median lines and box boundaries harder to compare, because it is more difficult to compare vertical line segments than horizontal ones.
Instead putting the labels on the y-axis, we can use theme()
to angle the labels like you learned earlier.
%>%
congress mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
The final step is to add a reference line indicating the overall median age.
%>%
congress mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_hline(aes(yintercept = median(age)), color = "red") +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
From this plot, you can see that the median ages don’t vary much across division. The size of the boxes are also similar. If you look at where in each box the median lines fall, however, you’ll notice that some are in the middle, some towards the top, and some towards the bottom. This suggests that the individual distributions are different shapes.
If you care about distribution shape, geom_violin()
is an alternative to geom_boxplot()
.
%>%
congress mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_hline(aes(yintercept = median(age)), color = "red") +
geom_violin() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
Violin plots are mirrored and flipped density plots. The thickness of the “violin” indicates how many values are in that area. With the violin plots, you can now tell that the distribution of ages look slightly different for different divisions. Note that, because violin plots are a form of density plot, they are only a good idea if you have sufficient data.
By default, geom_violin()
doesn’t display the median of each group. You can add quantile lines using the draw_quantiles
argument. To just draw median lines, set draw_quantiles
to 0.5.
%>%
congress mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_hline(aes(yintercept = median(age)), color = "red") +
geom_violin(draw_quantiles = 0.5) +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
(geom_violin()
calculates the quantiles of the density estimate, which is not the same as the quantiles of the data used by quantile()
, median()
, and geom_boxplot()
. This is why the 0.5 quantile lines aren’t perfectly ordered.)
Neither boxplots nor violin plots display the amount of data behind each plot, and so you can’t tell from the plots that New England has fewer representatives than the Pacific division. There are positives and negatives to every visualization decision you’ll make, and so the best geom will depend on what you want your plot to show.