
5 Advanced data manipulation & visualization
5.1 Lesson preamble
5.1.1 Learning Objectives
Learn to make histograms, line plots, and reference lines using
ggplot(geom_histogram,geom_line,geom_abline)Understand and apply faceting in
ggplot(usingfacet_wrapandfacet_grid)Learn to switch between long and wide format data using
pivot_longerandpivot_widerFormat plots to be more readable using ggplot’s
themeoptions and labeling (labs)
5.2 Review
We’ll be continuing to work for one more lecture on the rich dataset from the Portal Project, which is a 30+ year study of a Chihuahuan desert ecosystem near Portal, Arizona, USA. The dataset we’re working with contains observations of animals found at the study site, which was subdivided into plots which underwent controlled environmental manipulation in some years.
We’ll make sure the tidyverse package is loaded and then read the data in again
Note: make sure to clear your environment before starting a new class
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# If you didn't download and save the data locally last time
# download.file("https://ndownloader.figshare.com/files/2292169", "data/portal_data.csv")
surveys <- readr::read_csv('data/portal_data.csv')Rows: 34786 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys# A tibble: 34,786 × 13
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 72 8 19 1977 2 NL M 31 NA
3 224 9 13 1977 2 NL <NA> NA NA
4 266 10 16 1977 2 NL <NA> NA NA
5 349 11 12 1977 2 NL <NA> NA NA
6 363 11 12 1977 2 NL <NA> NA NA
7 435 12 10 1977 2 NL <NA> NA NA
8 506 1 8 1978 2 NL <NA> NA NA
9 588 2 18 1978 2 NL M NA 218
10 661 3 11 1978 2 NL <NA> NA NA
# ℹ 34,776 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
Before we go further, we’re just going to add one new column to surveys : the full Latin name (Genus species) in a new column called species_latin
surveys <- surveys %>%
mutate(species_latin = paste(genus,species)) # new column for full Latin nameUsually we don’t recommend altering the original data, but instead suggest saving edited forms under a new name, but because this is simply adding a new column (which is just a restatement of existing information), we’ll break that rule for now!
For each animal captured, the weight and hindfoot length were measured, and we previously investigated the relationship between the two variables (after removing missing data and filtering for species that weren’t very rare) using a scatterplot and found strong clustering by species
surveys_abun_species <- surveys %>%
filter(!is.na(hindfoot_length) & !is.na(weight)) %>% # remove rows with NA in either
group_by(species) %>%
mutate(n = n()) %>% # add count value (total observations of each species) to each row
filter(n > 800) %>% # filter for species observed >800 times total
select(-n) # remove the n column, don't need it anyomre
surveys_abun_species %>%
ggplot(aes(x = weight, y = hindfoot_length, colour = species_latin)) +
geom_point(alpha = 0.2)
Looking at the same variables a different way,
surveys_abun_species %>%
ggplot(aes(x = species_latin, y = hindfoot_length, colour = weight)) +
geom_jitter(size = 0.1, height = 0, width = 0.4) +
theme(axis.text.x = element_text(angle = 45,vjust=1, hjust=1))
5.3 Visualizing data with other plot types
In this section, we’ll explore a few more types of plots we can make in ggplot.
First, let’s go back to the originally-imported data and just count the total number of times each species was observed. We need to group the data and count records within each group:
surveys %>%
group_by(species_latin) %>%
tally()%>%
arrange(desc(n)) # Adding arrange just to compare with histogram# A tibble: 48 × 2
species_latin n
<chr> <int>
1 Dipodomys merriami 10596
2 Chaetodipus penicillatus 3123
3 Dipodomys ordii 3027
4 Chaetodipus baileyi 2891
5 Reithrodontomys megalotis 2609
6 Dipodomys spectabilis 2504
7 Onychomys torridus 2249
8 Perognathus flavus 1597
9 Peromyscus eremicus 1299
10 Neotoma albigula 1252
# ℹ 38 more rows
Before doing any complex analysis, it’s important to understand how much data there is. Are there species for which very few animals were ever observed in the study site? The best way to look at this sort of count data is with a histogram.
To create a histogram, we can assign this table to a variable, and then pass that variable to ggplot().
species_counts <- surveys %>%
group_by(species_latin) %>%
tally()
ggplot(species_counts, aes(x = n)) +
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Remember that a histogram plots the number of observations based on a variable, so you only need to specify the x-axis in the ggplot() call.
A histogram’s bin size can really change what you might understand about the data. The histogram geom has a bins argument that allows you to specify the number of bins and a binwidth argument that allows you to specify the size of the bins.
If we reduce the number of bins, the plot looks cleaner, but we are missing details on most of the species, which are all falling into the lowest bin.
ggplot(species_counts, aes(x = n)) +
geom_histogram(bins=10)
Alternatively we can specify how wide each bin should be using binwidth - let’s try 100, and move the rectangular bars so that instead of being centered at the median value within them (default), they span from the lowest to highest value included in the bin
ggplot(species_counts, aes(x = n)) +
geom_histogram(binwidth=100, boundary = 0)
We can understand the distribution even better by defining custom bin widths, which allows us to get more resolution for the rarely-observed species while not having too many empty bins for the few very frequently-observed species
ggplot(species_counts, aes(x = n)) +
geom_histogram(breaks = c(seq(0,1000,100),seq(2000,11000,1000)), boundary = 0)
From this latest version, we can see there’s a huge number of species (over 30!) that were observed less than 100 times over the entire 4 decade study period. These are the ones we might want to filter out going forward, since there’s probably too little data to to interpret trends. Using the value of 800 observations as a cut off for filtering (what we did previously) seems reasonable!
Now, since this was a longitudinal study, let’s calculate the number of counts per year for each species. First, we need to group the data and count records within each group:
yearly_counts <- surveys %>%
group_by(year, species_latin) %>%
tally() %>%
arrange(desc(n)) # Adding arrange just to compare with histogram
yearly_counts# A tibble: 509 × 3
# Groups: year [26]
year species_latin n
<dbl> <chr> <int>
1 2002 Chaetodipus baileyi 892
2 1985 Dipodomys merriami 667
3 1982 Dipodomys merriami 609
4 1997 Dipodomys merriami 576
5 1981 Dipodomys merriami 559
6 2000 Chaetodipus baileyi 555
7 2001 Chaetodipus baileyi 538
8 1983 Dipodomys merriami 528
9 1998 Dipodomys merriami 503
10 1980 Dipodomys merriami 493
# ℹ 499 more rows
Now, let’s again do a histogram
ggplot(yearly_counts, aes(x = n)) +
geom_histogram(boundary=0)
Note that creating an intermediate variable for the processed data ( yearly_counts ) is be preferable for time consuming calculations, because you would not want to do that operation every time you change the plot aesthetics. However, if it is not a time consuming calculation, or you would like the flexibility of changing the data summary and the plotting options in the same code chunk, you can pipe the output of your split-apply-combine operation to the plotting command:
surveys %>%
group_by(year, species_latin) %>%
tally() %>%
ggplot(aes(x = n)) +
geom_histogram(boundary = 1)
We can perform a quick check that the plot corresponds to the table by colouring the histogram by species. However, we first want to get rid of all those rarely observed species that we know will crowd our plot
yearly_counts_abun <- yearly_counts %>%
group_by(species_latin) %>%
mutate(n_tot_species=sum(n)) %>% # new column containing total observations per species
filter(n_tot_species>800) %>%
select(-n_tot_species)
ggplot(yearly_counts_abun, aes(x = n, fill = species_latin)) +
geom_histogram(boundary = 0)
# We are using "fill" here instead of "colour"Note: Here we are using fill to assign colours to species rather than colour. In general colour refers to the outline of points/bars or whatever it is you are plotting and fill refers to the colour that goes inside the point or bar. If you are confused, try switching out fill for colour to see what looks best!
Now let’s explore how the number of each species varies over time. Longitudinal data can be visualized as a line plot with years on the x axis and counts on the y axis:
yearly_counts_abun %>%
ggplot(aes(x = year, y = n)) +
geom_line()
Unfortunately, this does not work because we plotted data for all the species together as one line. We need to tell ggplot to draw a line for each species by modifying the aesthetic function to include group = species:
yearly_counts_abun %>%
ggplot(aes(x = year, y = n, group = species_latin)) +
geom_line()
We will be able to distinguish species in the plot if we add colours (using colour also automatically groups the data):
annual_plot <- yearly_counts_abun %>%
ggplot(aes(x = year, y = n, color = species_latin)) + # you can say color or colour
geom_line()
annual_plot
There seem to be large fluctuations over time in the levels of many of these species! We can calculate an average
avg_species <- yearly_counts_abun %>%
group_by(species_latin) %>%
summarize(avg_yearly_count = mean(n))
avg_species# A tibble: 12 × 2
species_latin avg_yearly_count
<chr> <dbl>
1 Chaetodipus baileyi 361.
2 Chaetodipus penicillatus 120.
3 Dipodomys merriami 408.
4 Dipodomys ordii 116.
5 Dipodomys spectabilis 119.
6 Neotoma albigula 48.2
7 Onychomys leucogaster 45.7
8 Onychomys torridus 86.5
9 Perognathus flavus 66.5
10 Peromyscus eremicus 50.0
11 Peromyscus maniculatus 45.0
12 Reithrodontomys megalotis 100.
annual_plot +
geom_abline(data=avg_species, aes(slope = 0, intercept = avg_yearly_count, colour = species_latin), linetype = "dashed")
5.4 Creating multi-panel plots (faceting)
ggplot has a special technique called faceting that allows the user to split one plot into multiple subplots based on a variable included in the dataset. This allows us to examine the trends associated with each grouping variable more closely. We will use it to make a time series plot for each species:
yearly_counts_abun %>%
ggplot(aes(x = year, y = n)) +
geom_line() +
facet_wrap(~species_latin)
The facet_wrap function creates a new subplot for each value of the variable passed to it (~name). You can control the number of rows and columns as well as lots of other options.
Now we would like to split the line in each plot by the sex of each individual measured. To do that we need to make counts in the data frame after grouping by year, species_latin, and sex, so we are going to redo our earlier filtering, preserving the breakdown by sex:
yearly_counts_by_sex_abun <- surveys %>%
group_by(year, species_latin, sex) %>%
tally() %>%
group_by(species_latin) %>%
mutate(n_tot_species=sum(n)) %>% # new column containing total observations per species
filter(n_tot_species>800) %>%
select(-n_tot_species)
yearly_counts_by_sex_abun# A tibble: 687 × 4
# Groups: species_latin [12]
year species_latin sex n
<dbl> <chr> <chr> <int>
1 1977 Chaetodipus penicillatus F 4
2 1977 Chaetodipus penicillatus M 3
3 1977 Dipodomys merriami F 104
4 1977 Dipodomys merriami M 152
5 1977 Dipodomys merriami <NA> 8
6 1977 Dipodomys ordii F 10
7 1977 Dipodomys ordii M 2
8 1977 Dipodomys spectabilis F 58
9 1977 Dipodomys spectabilis M 34
10 1977 Dipodomys spectabilis <NA> 6
# ℹ 677 more rows
We can reflect this grouping by sex in the faceted plot by splitting further with colour (within a single plot).
yearly_counts_by_sex_abun %>%
ggplot(aes(x = year, y = n, colour = sex)) +
geom_line() +
facet_wrap(~species_latin)
There are several observations where sex was not recorded. Let’s filter out those values.
yearly_counts_by_sex_abun <- yearly_counts_by_sex_abun %>%
filter(!is.na(sex))
yearly_counts_by_sex_abun %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~species_latin)
It is possible to specify exactly which colors1 to use and to change the thickness of the lines to make the them easier to distinguish.
yearly_counts_by_sex_abun %>%
ggplot(aes(x = year, y = n, colour = sex)) +
geom_line(size = 1) +
scale_colour_manual(values = c("black", "orange")) +
facet_wrap(~species_latin) Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Not sure what colours would look good on your plot? The R Community got you covered! Check out these awesome color palettes where nice-looking color combos come predefined. We especially recommend the viridis color palettes. These palettes are not only pretty, they are specifically designed to be easier to read by those with colourblindness.
Lastly, let’s change the x labels so that they don’t overlap, split the subplot titles into multiple lines if they’re too long, and remove the grey background to increase contrast with the lines. To customize the non-data components of the plot, we will pass some theme statements2 to ggplot.
yearly_counts_by_sex_abun %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line(size = 1) +
scale_colour_viridis_d() +
facet_wrap(~species_latin, labeller = label_wrap_gen(width = 14)) +
theme_classic() +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 30, hjust = 1))
There are other popular theme options, such as theme_bw().
Our plot looks pretty polished now! It would be difficult to share with other, however, given the lack of information provided on the Y axis. Let’s add some meaningful axis labels.
yearly_counts_by_sex_abun %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line(size = 1) +
scale_colour_viridis_d() +
facet_wrap(~species_latin, labeller = label_wrap_gen(width = 14)) +
theme_classic() +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "Species abundance over time",
x = "Year",
y = "Number observed",
colour = "Sex")
There is a related sub-plotting function called facet_grid, which can instead take two variables, where one will vary along the rows of the grid and another along the columns. Here, we’ll make a set of plots to examine how different genuses of animals (genus) respond to the different experimental manipulations applied to the plots (plot_type).
surveys %>%
filter(!is.na(sex)) %>%
group_by(year, genus, plot_type, sex) %>%
tally() %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line(size = 1) +
facet_grid(genus ~ plot_type, scales = "free",labeller = label_wrap_gen(width = 10)) +
theme(text = element_text(size = 8),
axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "Rodent abundance over time",
x = "Year",
y = "Number observed",
colour = "Sex")
It might be hard to see the plot (it’ll be cramped!) if you’re just viewing the Markdown notebook in the RStudio editor. You can click on the pop-out window icon in the top right corner of the plot to open a new window and expand it (the text will adjust accordingly).
5.4.1 Challenge
Use the filtered data frame (surveys_abun_species) for part 2.
1. Remember the histogram coloured according to each species? Starting from that code, how could we separate each species into its own subplot?
2.a. Create a plot that shows the average weight over years. Which year was the average weight of all animals the highest?
2.b. Iterate on the plot so it shows differences among species of their average weight over time. Is the yearly trend the same for all species?
5.5 Reshaping with pivot_wider and pivot_longer
5.5.1 Defining wide vs long data
The survey data presented here is almost in what we call a long format – every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots/trials. For example, what is the relationship between mean weights of different genera across all plots?
To answer that question, we want each plot to have its own row, with each measurements in its own column. This is called a wide data format. For the surveys data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we can reduce the dataset and begin to look for some relationships we’d want to examine. We need to create a new table where each row is the values for a particular variable associated with each plot. In practical terms, this means the values in genus would become the names of column variables and the cells would contain the values of the mean weight observed on each plot by genus.
We can use the functions called pivot_wider() and pivot_longer() (these are newer replacements for spread() and gather(), which were the older functions). These can feel tricky to think through, but do not feel alone in this! Many others have squinted at their data, unsure exactly how to reshape it, so there are many guides and cheatsheets available to help!
5.5.2 Summary of long vs wide formats
Long format:
- every column is a variable
- first column(s) repeat
- every row is an observation
Wide format:
- each row is a measured thing
- each column is an independent observation
- first column does not repeat
5.5.3 Long to Wide with pivot_wider
Let’s start by using dplyr to create a data frame with the mean body weight of each genus by plot.
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(genus, plot_id) %>%
summarize(mean_weight = mean(weight))`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
surveys_gw# A tibble: 196 × 3
# Groups: genus [10]
genus plot_id mean_weight
<chr> <dbl> <dbl>
1 Baiomys 1 7
2 Baiomys 2 6
3 Baiomys 3 8.61
4 Baiomys 5 7.75
5 Baiomys 18 9.5
6 Baiomys 19 9.53
7 Baiomys 20 6
8 Baiomys 21 6.67
9 Chaetodipus 1 22.2
10 Chaetodipus 2 25.1
# ℹ 186 more rows
Now, to make this long data wide, we use pivot_wider() from tidyr to spread out the different taxa into columns. pivot_wider() takes 3 arguments: the data , the names_from column variable that will eventually become the column names, and the values_from column variable that will fill in the values. We’ll use a pipe so we don’t need to explicitly supply the data argument.
surveys_gw_wide <- surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight)
surveys_gw_wide# A tibble: 24 × 11
plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 7 22.2 60.2 156. 27.7 9.62
2 2 6 25.1 55.7 169. 26.9 6.95
3 3 8.61 24.6 52.0 158. 26.0 7.51
4 5 7.75 18.0 51.1 190. 27.0 8.66
5 18 9.5 26.8 61.4 149. 26.6 8.62
6 19 9.53 26.4 43.3 120 23.8 8.09
7 20 6 25.1 65.9 155. 25.2 8.14
8 21 6.67 28.2 42.7 138. 24.6 9.19
9 4 NA 23.0 57.5 164. 28.1 7.82
10 6 NA 24.9 58.6 180. 25.9 7.81
# ℹ 14 more rows
# ℹ 4 more variables: Peromyscus <dbl>, Reithrodontomys <dbl>, Sigmodon <dbl>,
# Spermophilus <dbl>
Now we can go back to our original question: what is the relationship between mean weights of different genera across all plots? We can easily see the weights for each genus in each plot! Notice that some genera have NA values. That’s because some genera were not recorded in that plot.
5.5.4 Wide to long with pivot_longer
What if we had the opposite problem, and wanted to go from a wide to long format? For that, we can use pivot_longer() to gather a set of columns into one key-value pair. To go backwards from surveys_gw_wide, we should exclude plot_id.
pivot_longer() takes 4 arguments: the data, the names_to column variable that comes from the column names, the values_to column with the values, and cols which specifies which columns we want to keep or drop. Again, we will pipe from the dataset so we don’t have to specify the data argument:
surveys_gw_long <- surveys_gw_wide %>%
pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)
surveys_gw_long# A tibble: 240 × 3
plot_id genus mean_weight
<dbl> <chr> <dbl>
1 1 Baiomys 7
2 1 Chaetodipus 22.2
3 1 Dipodomys 60.2
4 1 Neotoma 156.
5 1 Onychomys 27.7
6 1 Perognathus 9.62
7 1 Peromyscus 22.2
8 1 Reithrodontomys 11.4
9 1 Sigmodon NA
10 1 Spermophilus NA
# ℹ 230 more rows
If the columns are directly adjacent as they are here, we don’t even need to list the all out: we can just use the : operator, as before.
surveys_gw_wide %>%
pivot_longer(names_to = "genus", values_to = "mean_weight", cols = Baiomys:Spermophilus)# A tibble: 240 × 3
plot_id genus mean_weight
<dbl> <chr> <dbl>
1 1 Baiomys 7
2 1 Chaetodipus 22.2
3 1 Dipodomys 60.2
4 1 Neotoma 156.
5 1 Onychomys 27.7
6 1 Perognathus 9.62
7 1 Peromyscus 22.2
8 1 Reithrodontomys 11.4
9 1 Sigmodon NA
10 1 Spermophilus NA
# ℹ 230 more rows
Note that now the NA genera are included in the long format.
5.5.5 Challenge
Starting with the surveys_gw_wide dataset, how would you display a new dataset that gathers the mean weight of all the genera (excluding NAs) except for the genus Perognathus?