Histograms, boxplots, and pointalism using {ggplot2}.
By following along with this document, you will know how to:
ggplot()
aes()
geom_histogram()
geom_boxplot()
labs()
ggsave()
Basically, this help document will provide you with the tools necessary to complete the labs for GLHLTH 705.
If you’re interested in going one level deeper, we highly recommend you check out the following resources, which will give a better introduction to {ggplot2} than we ever could (also why we’re plugging these at the top).
As R rule of thumb, it’s good to have multiple mediums of exposure to the same idea. We recommend you pick the one that suits your learning style and come back for more later:
John Little is nothing short of the world’s best librarian.
R for Data Science, Chapter 3, by Hadley Wickham
After a preface and introduction, this is the first actual chapter in R4DS. The rationale is that {ggplot2} is actually pretty fun and satisfying to use. It’s pretty well guaranteed to have you hooked if you give it a chance.
“A Layered Grammar of Graphics” by Hadley Wickham
Published in the Journal of Computational and Graphical Statistics, 2010
ggplot2: Elegant Graphics for Data Analysis, by Hadely Wickham
**cough** Also by Hadley Wickham. (look it up using the Duke Library search engine and log in with your Duke credentials
Reference page of ggplot2 commands
For those of you who made it through that onslaught of links without clicking on a single one, welcome to the Core Competencies section. We hope that this section is somehow dry enough that you go and find your answers in one of the resources above. But for those of you who are still feeling stubborn:
ggplot()
To initialize a plotting space, we first need to tell R that we want
to use ggplot. If we just call ggplot()
and run it without
any data, we get a blank field. This is our sandbox:
ggplot()
For this example, we’ll use the dataset available in base R called
iris
, which provides measurements and species data on a
bunch of – you guessed it – irises.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
We can add the dataframe to our plot by including it in the argument
data =
.
Superficially, this doesn’t change our output:
ggplot(data = iris)
But the data frame is now a part of the plot. One way to verify this
is by assigning the two previous plots a name and inspecting their size
with object.size()
. The plot with the data should be
bigger:
without_data <- ggplot()
object.size(without_data)
#> 3720 bytes
with_data <- ggplot(data = iris)
object.size(with_data)
#> 10704 bytes
aes()
:Next, we need to tell ggplot which variables we’re working with, and
where to put them (their “aesthetic mapping”). We do this using the
argument aes(x = variable1, y = variable2)
If we’re creating histograms and singular boxplots, we only require a single variable on the x-axis. We can initialize it as follows:
See how ggplot assigned Sepal.Length to the x axis?
geom_histogram()
Okay, let’s cut to the chase. We want a plot.
{ggplot2} has a large number of plotting types and styles. Given the
types of variables we’ve mapped to ggplot’s aesthetics, all we need to
do is choose a type of plot appropriate for that type of variable, and
add it as a new layer with +
ggplot(data = iris, mapping = aes(x = Sepal.Length)) +
geom_histogram()
Many plots also allow us to add color with the argument
aes(fill = "colorname")
(colors are always written as
strings, in quotes!).
We may also change the size of our bins with argument
binwidth = x
:
ggplot(data = iris, mapping = aes(x = Sepal.Length)) +
geom_histogram(fill = "#12BBAC", binwidth = .25)
geom_boxplot()
A single boxplot functions in the same exact manner. Instead of
geom_histogram()
, we add a boxplot layer with
geom_boxplot()
. This time, I’ve used a default
color name, goldenrod2
, instead of a hexadecimal color
code:
ggplot(data = iris, mapping = aes(x = Sepal.Length)) +
geom_boxplot(fill = 'goldenrod2')
We can create multiple boxplots within a single plot by adding a
categorical variable as a second aesthetic. The iris
dataset contains a categorical variable, Species
, which
would be appropriate for this task:
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Species)) +
geom_boxplot(fill = '#AC12BB')
We can also display the boxplots vertically by assigning
Sepal.Length
to y =
and Species
to x =
:
ggplot(data = iris, mapping = aes(y = Sepal.Length, x = Species)) +
geom_boxplot(fill = '#AC12BB')
labs()
Finally, we need to make our plots fit for public use… it needs axis
labels and a title. We can specify these by adding another layer to our
plot, labs()
. Make sure you write your labels as
strings:
ggplot(data = iris, mapping = aes(y = Sepal.Length, x = Species)) +
geom_boxplot(fill = "#12BBAC") +
labs(x = "Species",
y = "Sepal Length",
title = "Boxplots of Sepal Length of Irises by Species")
facet_wrap()
or
filter()
You might be wondering what this sort of stratification might look life if we tried the same thing with a histogram. Can a histogram accept a y aesthetic? When we try and assign a second aesthetic to a histogram, we get the following result:
ggplot(data = iris, mapping = aes(x = Species, y = Sepal.Length)) +
geom_histogram(fill = "goldenrod2")
#> Error: stat_bin() can only have an x or y aesthetic.
facet_wrap()
An easy way to generate stratified histograms is with the additional
layer, facet_wrap()
, which takes a formula in the following
syntax:
. ~ stratifyingVariable
The period here represents our ggplot object. We put the stratifying
variable on the right side of the formula as a way to designate it as
the “independent variable” of sorts. The output, .
,
depends on whatever categorical we assign as our faceting
variable. In this case, the histograms dependson the variable
Species
:
ggplot(data = iris, mapping = aes(x = Sepal.Length)) +
geom_histogram(fill = "goldenrod2") +
facet_wrap(. ~ Species) +
labs(x = "Sepal Length",
y = "Count",
title = "Histograms of Sepal Length of Irises by Species")
We might decide that we want the plots stacked vertically instead of
horizontally to help us better compare their distributions by Sepal
Length. We can do that too, with the argument nrow =
:
ggplot(data = iris, mapping = aes(x = Sepal.Length)) +
geom_histogram(fill = "#AC12BB") +
facet_wrap(. ~ Species, nrow = 3) +
labs(x = "Sepal Length",
y = "Count",
title = "Histograms of Sepal Length of Irises by Species")
filter()
What if we only want the Sepal Lengths for the species Iris virginica?
One way would be to use filter()
, which we connect to
our ggplot using a pipe: