5 Advanced data manipulation & visualization

Published

January 19, 2026

5.1 Lesson preamble

5.1.1 Learning Objectives

Learn to make histograms, line plots, and reference lines using ggplot (geom_histogram, geom_line, geom_abline)

Understand and apply faceting in ggplot (using facet_wrap and facet_grid)

Learn to switch between long and wide format data using pivot_longer and pivot_wider

Format plots to be more readable using ggplot’s theme options and labeling (labs)

5.2 Review

We’ll be continuing to work for one more lecture on the rich dataset from the Portal Project, which is a 30+ year study of a Chihuahuan desert ecosystem near Portal, Arizona, USA. The dataset we’re working with contains observations of animals found at the study site, which was subdivided into plots which underwent controlled environmental manipulation in some years.

The Portal Project is a long-term ecological study in the a desert ecosystem in Arizona, which includes frequent sampling over 30+ years and controlled manipulations of the ecosystem

We’ll make sure the tidyverse package is loaded and then read the data in again

Note: make sure to clear your environment before starting a new class

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# If you didn't download and save the data locally last time
# download.file("https://ndownloader.figshare.com/files/2292169", "data/portal_data.csv")

surveys <- readr::read_csv('data/portal_data.csv')

Rows: 34786 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

surveys

# A tibble: 34,786 × 13
   record_id month   day  year plot_id species_id sex   hindfoot_length weight
       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
 1         1     7    16  1977       2 NL         M                  32     NA
 2        72     8    19  1977       2 NL         M                  31     NA
 3       224     9    13  1977       2 NL         <NA>               NA     NA
 4       266    10    16  1977       2 NL         <NA>               NA     NA
 5       349    11    12  1977       2 NL         <NA>               NA     NA
 6       363    11    12  1977       2 NL         <NA>               NA     NA
 7       435    12    10  1977       2 NL         <NA>               NA     NA
 8       506     1     8  1978       2 NL         <NA>               NA     NA
 9       588     2    18  1978       2 NL         M                  NA    218
10       661     3    11  1978       2 NL         <NA>               NA     NA
# ℹ 34,776 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

Before we go further, we’re just going to add one new column to surveys : the full Latin name (Genus species) in a new column called species_latin

surveys <- surveys %>% 
  mutate(species_latin = paste(genus,species)) # new column for full Latin name

Usually we don’t recommend altering the original data, but instead suggest saving edited forms under a new name, but because this is simply adding a new column (which is just a restatement of existing information), we’ll break that rule for now!

For each animal captured, the weight and hindfoot length were measured, and we previously investigated the relationship between the two variables (after removing missing data and filtering for species that weren’t very rare) using a scatterplot and found strong clustering by species

surveys_abun_species <- surveys %>% 
  filter(!is.na(hindfoot_length) & !is.na(weight)) %>% # remove rows with NA in either
  group_by(species) %>%
  mutate(n = n()) %>% # add count value (total observations of each species) to each row
  filter(n > 800) %>% # filter for species observed >800 times total
  select(-n) # remove the n column, don't need it anyomre

surveys_abun_species %>%
  ggplot(aes(x = weight, y = hindfoot_length, colour = species_latin)) +
  geom_point(alpha = 0.2)

Looking at the same variables a different way,

surveys_abun_species %>%
  ggplot(aes(x = species_latin, y = hindfoot_length, colour = weight)) +
  geom_jitter(size = 0.1, height = 0, width = 0.4) + 
  theme(axis.text.x = element_text(angle = 45,vjust=1, hjust=1))

5.3 Visualizing data with other plot types

In this section, we’ll explore a few more types of plots we can make in ggplot.

First, let’s go back to the originally-imported data and just count the total number of times each species was observed. We need to group the data and count records within each group:

surveys %>% 
  group_by(species_latin) %>%
  tally()%>%
  arrange(desc(n)) # Adding arrange just to compare with histogram

# A tibble: 48 × 2
   species_latin                 n
   <chr>                     <int>
 1 Dipodomys merriami        10596
 2 Chaetodipus penicillatus   3123
 3 Dipodomys ordii            3027
 4 Chaetodipus baileyi        2891
 5 Reithrodontomys megalotis  2609
 6 Dipodomys spectabilis      2504
 7 Onychomys torridus         2249
 8 Perognathus flavus         1597
 9 Peromyscus eremicus        1299
10 Neotoma albigula           1252
# ℹ 38 more rows

Before doing any complex analysis, it’s important to understand how much data there is. Are there species for which very few animals were ever observed in the study site? The best way to look at this sort of count data is with a histogram.

To create a histogram, we can assign this table to a variable, and then pass that variable to ggplot().

species_counts <- surveys %>% 
  group_by(species_latin) %>%
  tally()

ggplot(species_counts, aes(x = n)) +
    geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Remember that a histogram plots the number of observations based on a variable, so you only need to specify the x-axis in the ggplot() call.

A histogram’s bin size can really change what you might understand about the data. The histogram geom has a bins argument that allows you to specify the number of bins and a binwidth argument that allows you to specify the size of the bins.

If we reduce the number of bins, the plot looks cleaner, but we are missing details on most of the species, which are all falling into the lowest bin.

ggplot(species_counts, aes(x = n)) +
    geom_histogram(bins=10)

Alternatively we can specify how wide each bin should be using binwidth - let’s try 100, and move the rectangular bars so that instead of being centered at the median value within them (default), they span from the lowest to highest value included in the bin

ggplot(species_counts, aes(x = n)) +
    geom_histogram(binwidth=100, boundary = 0)

We can understand the distribution even better by defining custom bin widths, which allows us to get more resolution for the rarely-observed species while not having too many empty bins for the few very frequently-observed species

ggplot(species_counts, aes(x = n)) +
    geom_histogram(breaks = c(seq(0,1000,100),seq(2000,11000,1000)), boundary = 0)

From this latest version, we can see there’s a huge number of species (over 30!) that were observed less than 100 times over the entire 4 decade study period. These are the ones we might want to filter out going forward, since there’s probably too little data to to interpret trends. Using the value of 800 observations as a cut off for filtering (what we did previously) seems reasonable!

Now, since this was a longitudinal study, let’s calculate the number of counts per year for each species. First, we need to group the data and count records within each group:

yearly_counts <- surveys %>%
    group_by(year, species_latin) %>%
    tally() %>%
    arrange(desc(n)) # Adding arrange just to compare with histogram

yearly_counts

# A tibble: 509 × 3
# Groups:   year [26]
    year species_latin           n
   <dbl> <chr>               <int>
 1  2002 Chaetodipus baileyi   892
 2  1985 Dipodomys merriami    667
 3  1982 Dipodomys merriami    609
 4  1997 Dipodomys merriami    576
 5  1981 Dipodomys merriami    559
 6  2000 Chaetodipus baileyi   555
 7  2001 Chaetodipus baileyi   538
 8  1983 Dipodomys merriami    528
 9  1998 Dipodomys merriami    503
10  1980 Dipodomys merriami    493
# ℹ 499 more rows

Now, let’s again do a histogram

ggplot(yearly_counts, aes(x = n)) +
    geom_histogram(boundary=0)

Note that creating an intermediate variable for the processed data ( yearly_counts ) is be preferable for time consuming calculations, because you would not want to do that operation every time you change the plot aesthetics. However, if it is not a time consuming calculation, or you would like the flexibility of changing the data summary and the plotting options in the same code chunk, you can pipe the output of your split-apply-combine operation to the plotting command:

surveys %>%
    group_by(year, species_latin) %>%
    tally() %>%
    ggplot(aes(x = n)) +
        geom_histogram(boundary = 1)

We can perform a quick check that the plot corresponds to the table by colouring the histogram by species. However, we first want to get rid of all those rarely observed species that we know will crowd our plot

yearly_counts_abun <- yearly_counts %>%
  group_by(species_latin) %>%
  mutate(n_tot_species=sum(n)) %>% # new column containing total observations per species
  filter(n_tot_species>800) %>%
  select(-n_tot_species)

ggplot(yearly_counts_abun, aes(x = n, fill = species_latin)) + 
        geom_histogram(boundary = 0)

# We are using "fill" here instead of "colour"

Note: Here we are using fill to assign colours to species rather than colour. In general colour refers to the outline of points/bars or whatever it is you are plotting and fill refers to the colour that goes inside the point or bar. If you are confused, try switching out fill for colour to see what looks best!

Now let’s explore how the number of each species varies over time. Longitudinal data can be visualized as a line plot with years on the x axis and counts on the y axis:

yearly_counts_abun  %>%
    ggplot(aes(x = year, y = n)) +
        geom_line()

Unfortunately, this does not work because we plotted data for all the species together as one line. We need to tell ggplot to draw a line for each species by modifying the aesthetic function to include group = species:

yearly_counts_abun  %>%
    ggplot(aes(x = year, y = n, group = species_latin)) +
        geom_line()

We will be able to distinguish species in the plot if we add colours (using colour also automatically groups the data):

annual_plot <- yearly_counts_abun  %>%
    ggplot(aes(x = year, y = n, color = species_latin)) + # you can say color or colour
        geom_line()

annual_plot

There seem to be large fluctuations over time in the levels of many of these species! We can calculate an average

avg_species <- yearly_counts_abun %>%
  group_by(species_latin) %>%
  summarize(avg_yearly_count = mean(n))

avg_species

# A tibble: 12 × 2
   species_latin             avg_yearly_count
   <chr>                                <dbl>
 1 Chaetodipus baileyi                  361. 
 2 Chaetodipus penicillatus             120. 
 3 Dipodomys merriami                   408. 
 4 Dipodomys ordii                      116. 
 5 Dipodomys spectabilis                119. 
 6 Neotoma albigula                      48.2
 7 Onychomys leucogaster                 45.7
 8 Onychomys torridus                    86.5
 9 Perognathus flavus                    66.5
10 Peromyscus eremicus                   50.0
11 Peromyscus maniculatus                45.0
12 Reithrodontomys megalotis            100.

annual_plot + 
  geom_abline(data=avg_species, aes(slope = 0, intercept = avg_yearly_count, colour = species_latin), linetype = "dashed")

5.4 Creating multi-panel plots (faceting)

ggplot has a special technique called faceting that allows the user to split one plot into multiple subplots based on a variable included in the dataset. This allows us to examine the trends associated with each grouping variable more closely. We will use it to make a time series plot for each species:

yearly_counts_abun %>%
    ggplot(aes(x = year, y = n)) + 
        geom_line() +
        facet_wrap(~species_latin)

The facet_wrap function creates a new subplot for each value of the variable passed to it (~name). You can control the number of rows and columns as well as lots of other options.

Now we would like to split the line in each plot by the sex of each individual measured. To do that we need to make counts in the data frame after grouping by year, species_latin, and sex, so we are going to redo our earlier filtering, preserving the breakdown by sex:

yearly_counts_by_sex_abun <- surveys %>%
    group_by(year, species_latin, sex) %>%
    tally() %>% 
    group_by(species_latin) %>%
    mutate(n_tot_species=sum(n)) %>% # new column containing total observations per species
    filter(n_tot_species>800) %>%
    select(-n_tot_species)

yearly_counts_by_sex_abun

# A tibble: 687 × 4
# Groups:   species_latin [12]
    year species_latin            sex       n
   <dbl> <chr>                    <chr> <int>
 1  1977 Chaetodipus penicillatus F         4
 2  1977 Chaetodipus penicillatus M         3
 3  1977 Dipodomys merriami       F       104
 4  1977 Dipodomys merriami       M       152
 5  1977 Dipodomys merriami       <NA>      8
 6  1977 Dipodomys ordii          F        10
 7  1977 Dipodomys ordii          M         2
 8  1977 Dipodomys spectabilis    F        58
 9  1977 Dipodomys spectabilis    M        34
10  1977 Dipodomys spectabilis    <NA>      6
# ℹ 677 more rows

We can reflect this grouping by sex in the faceted plot by splitting further with colour (within a single plot).

yearly_counts_by_sex_abun %>%
    ggplot(aes(x = year, y = n, colour = sex)) +
        geom_line() +
        facet_wrap(~species_latin)

There are several observations where sex was not recorded. Let’s filter out those values.

yearly_counts_by_sex_abun <- yearly_counts_by_sex_abun %>% 
  filter(!is.na(sex))

yearly_counts_by_sex_abun %>%
    ggplot(aes(x = year, y = n, color = sex)) +
        geom_line() +
        facet_wrap(~species_latin)

It is possible to specify exactly which colors¹ to use and to change the thickness of the lines to make the them easier to distinguish.

yearly_counts_by_sex_abun %>%
    ggplot(aes(x = year, y = n, colour = sex)) +
        geom_line(size = 1) +
        scale_colour_manual(values = c("black", "orange")) +
        facet_wrap(~species_latin)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Not sure what colours would look good on your plot? The R Community got you covered! Check out these awesome color palettes where nice-looking color combos come predefined. We especially recommend the viridis color palettes. These palettes are not only pretty, they are specifically designed to be easier to read by those with colourblindness.

Lastly, let’s change the x labels so that they don’t overlap, split the subplot titles into multiple lines if they’re too long, and remove the grey background to increase contrast with the lines. To customize the non-data components of the plot, we will pass some theme statements² to ggplot.

yearly_counts_by_sex_abun %>%
  ggplot(aes(x = year, y = n, color = sex)) +
  geom_line(size = 1) +
  scale_colour_viridis_d() +
  facet_wrap(~species_latin, labeller = label_wrap_gen(width = 14)) +
  theme_classic() +
  theme(text = element_text(size = 10),
        axis.text.x = element_text(angle = 30, hjust = 1))

There are other popular theme options, such as theme_bw().

Our plot looks pretty polished now! It would be difficult to share with other, however, given the lack of information provided on the Y axis. Let’s add some meaningful axis labels.

yearly_counts_by_sex_abun %>%
  ggplot(aes(x = year, y = n, color = sex)) +
  geom_line(size = 1) +
  scale_colour_viridis_d() +
  facet_wrap(~species_latin, labeller = label_wrap_gen(width = 14)) +
  theme_classic() +
  theme(text = element_text(size = 10),
        axis.text.x = element_text(angle = 30, hjust = 1)) +
  labs(title = "Species abundance over time",
       x = "Year",
       y = "Number observed",
       colour = "Sex")

There is a related sub-plotting function called facet_grid, which can instead take two variables, where one will vary along the rows of the grid and another along the columns. Here, we’ll make a set of plots to examine how different genuses of animals (genus) respond to the different experimental manipulations applied to the plots (plot_type).

surveys %>% 
  filter(!is.na(sex)) %>%
  group_by(year, genus, plot_type, sex) %>%
  tally() %>%
  ggplot(aes(x = year, y = n, color = sex)) +
  geom_line(size = 1) +
  facet_grid(genus ~ plot_type, scales = "free",labeller = label_wrap_gen(width = 10))  +
  theme(text = element_text(size = 8),
        axis.text.x = element_text(angle = 30, hjust = 1)) +
  labs(title = "Rodent abundance over time",
       x = "Year",
       y = "Number observed",
       colour = "Sex")

It might be hard to see the plot (it’ll be cramped!) if you’re just viewing the Markdown notebook in the RStudio editor. You can click on the pop-out window icon in the top right corner of the plot to open a new window and expand it (the text will adjust accordingly).

5.4.1 Challenge

Use the filtered data frame (surveys_abun_species) for part 2.

1. Remember the histogram coloured according to each species? Starting from that code, how could we separate each species into its own subplot?

2.a. Create a plot that shows the average weight over years. Which year was the average weight of all animals the highest?

2.b. Iterate on the plot so it shows differences among species of their average weight over time. Is the yearly trend the same for all species?

5.5 Reshaping with pivot_wider and pivot_longer

5.5.1 Defining wide vs long data

The survey data presented here is almost in what we call a long format – every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots/trials. For example, what is the relationship between mean weights of different genera across all plots?

To answer that question, we want each plot to have its own row, with each measurements in its own column. This is called a wide data format. For the surveys data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we can reduce the dataset and begin to look for some relationships we’d want to examine. We need to create a new table where each row is the values for a particular variable associated with each plot. In practical terms, this means the values in genus would become the names of column variables and the cells would contain the values of the mean weight observed on each plot by genus.

We can use the functions called pivot_wider() and pivot_longer() (these are newer replacements for spread() and gather(), which were the older functions). These can feel tricky to think through, but do not feel alone in this! Many others have squinted at their data, unsure exactly how to reshape it, so there are many guides and cheatsheets available to help!

5.5.2 Summary of long vs wide formats

Long format:

every column is a variable
- first column(s) repeat
every row is an observation

Wide format:

each row is a measured thing
each column is an independent observation
- first column does not repeat

5.5.3 Long to Wide with `pivot_wider`

Let’s start by using dplyr to create a data frame with the mean body weight of each genus by plot.

surveys_gw <- surveys %>%
    filter(!is.na(weight)) %>%
    group_by(genus, plot_id) %>%
    summarize(mean_weight = mean(weight))

`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.

surveys_gw

# A tibble: 196 × 3
# Groups:   genus [10]
   genus       plot_id mean_weight
   <chr>         <dbl>       <dbl>
 1 Baiomys           1        7   
 2 Baiomys           2        6   
 3 Baiomys           3        8.61
 4 Baiomys           5        7.75
 5 Baiomys          18        9.5 
 6 Baiomys          19        9.53
 7 Baiomys          20        6   
 8 Baiomys          21        6.67
 9 Chaetodipus       1       22.2 
10 Chaetodipus       2       25.1 
# ℹ 186 more rows

Now, to make this long data wide, we use pivot_wider() from tidyr to spread out the different taxa into columns. pivot_wider() takes 3 arguments: the data , the names_from column variable that will eventually become the column names, and the values_from column variable that will fill in the values. We’ll use a pipe so we don’t need to explicitly supply the data argument.

surveys_gw_wide <- surveys_gw %>% 
  pivot_wider(names_from = genus, values_from = mean_weight)

surveys_gw_wide

# A tibble: 24 × 11
   plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus
     <dbl>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1       1    7           22.2      60.2    156.      27.7        9.62
 2       2    6           25.1      55.7    169.      26.9        6.95
 3       3    8.61        24.6      52.0    158.      26.0        7.51
 4       5    7.75        18.0      51.1    190.      27.0        8.66
 5      18    9.5         26.8      61.4    149.      26.6        8.62
 6      19    9.53        26.4      43.3    120       23.8        8.09
 7      20    6           25.1      65.9    155.      25.2        8.14
 8      21    6.67        28.2      42.7    138.      24.6        9.19
 9       4   NA           23.0      57.5    164.      28.1        7.82
10       6   NA           24.9      58.6    180.      25.9        7.81
# ℹ 14 more rows
# ℹ 4 more variables: Peromyscus <dbl>, Reithrodontomys <dbl>, Sigmodon <dbl>,
#   Spermophilus <dbl>

Now we can go back to our original question: what is the relationship between mean weights of different genera across all plots? We can easily see the weights for each genus in each plot! Notice that some genera have NA values. That’s because some genera were not recorded in that plot.

5.5.4 Wide to long with `pivot_longer`

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we can use pivot_longer() to gather a set of columns into one key-value pair. To go backwards from surveys_gw_wide, we should exclude plot_id.

pivot_longer() takes 4 arguments: the data, the names_to column variable that comes from the column names, the values_to column with the values, and cols which specifies which columns we want to keep or drop. Again, we will pipe from the dataset so we don’t have to specify the data argument:

surveys_gw_long <- surveys_gw_wide %>% 
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)

surveys_gw_long

# A tibble: 240 × 3
   plot_id genus           mean_weight
     <dbl> <chr>                 <dbl>
 1       1 Baiomys                7   
 2       1 Chaetodipus           22.2 
 3       1 Dipodomys             60.2 
 4       1 Neotoma              156.  
 5       1 Onychomys             27.7 
 6       1 Perognathus            9.62
 7       1 Peromyscus            22.2 
 8       1 Reithrodontomys       11.4 
 9       1 Sigmodon              NA   
10       1 Spermophilus          NA   
# ℹ 230 more rows

If the columns are directly adjacent as they are here, we don’t even need to list the all out: we can just use the : operator, as before.

surveys_gw_wide %>% 
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = Baiomys:Spermophilus)

# A tibble: 240 × 3
   plot_id genus           mean_weight
     <dbl> <chr>                 <dbl>
 1       1 Baiomys                7   
 2       1 Chaetodipus           22.2 
 3       1 Dipodomys             60.2 
 4       1 Neotoma              156.  
 5       1 Onychomys             27.7 
 6       1 Perognathus            9.62
 7       1 Peromyscus            22.2 
 8       1 Reithrodontomys       11.4 
 9       1 Sigmodon              NA   
10       1 Spermophilus          NA   
# ℹ 230 more rows

Note that now the NA genera are included in the long format.

5.5.5 Challenge

Starting with the surveys_gw_wide dataset, how would you display a new dataset that gathers the mean weight of all the genera (excluding NAs) except for the genus Perognathus?

There are so many colors to chose from in R. Check out the R Color doc to find something that brings you joy.↩︎
The amount of control over various plot elements in ggplot is truly astonishing. Check out the complete list of themes here. Have fun!↩︎