4  Data wrangling!

4.1 Lesson preamble

4.1.1 Learning objectives

  • Understand the purpose of the dplyr package.
  • Learn to use data wrangling commands select, filter, %>%, and mutate from the dplyr package.
  • Understand the split-apply-combine concept for data analysis.
  • Use summarize, group_by, and tally to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
  • Learn to switch between long and wide format

4.1.2 Lesson outline

  • R packages for data analyses (10 min)
  • Data wrangling in dplyr (40 min)
  • Split-apply-combine techniques in dplyr (25 min)
  • Using group_by and tally to summarize categorical data (20 mins)
  • Reshaping data (15 mins)

4.2 R packages for data analyses

R packages are basically bundles of functions that perform related tasks. There are many some that will be come with a base install of R since they are considered critical for using R, such as c(), mean(), +, -, etc.

There is an official repository for R-packages beyond the base packages called CRAN (Comprehensive R Archive Network). CRAN has thousands of packages, and all these cannot be installed by default, because then base R installation would be huge and most people would only be using a fraction of everything installed on their machine. It would be like if you downloaded the Firefox or Chrome browser and you would get all extensions and add-ons installed by default, or as if your phone came with every app ever made for it already installed when you bought it: quite impractical.

We can install new packages using the function install.packages(). You only need to do this once, so we’ll pass eval=FALSE to knitr at the top of our code chunk to make sure that the chunk won’t be evaluated when we knit the document. You can find other possible options to pass that can be helpful for formatting your output document.

While we’re looking at the {} section of our code chunk, we might note that it starts with “r”. This specifies that this chunk is written in R and you could tell RStudio to to instead interpret the code in the chunk as a different language like bash (command line) or python. You can also specify names for the chunks. The knitr options are probably the most useful part of this section, though!

install.packages('tidyverse')

tidyverse1 is a large collection of packages with similar functions, similar to the way Microsoft Word is part of Microsoft Office. tidyverse, as its name may suggest, contains many packages that makes data cleaning and exploring more intuitive and effective. It is basically an entire philosophy on how to handle data and has a massive following.

The two tidyverse packages we will be using the most frequently in this course is dplyr and ggplot2. dplyr is great for data wrangling (Lecture 2) and ggplot2 makes killer plots (Lecture 3).

To use functions in the dplyr package, type dplyr:: and then the function name.

dplyr::glimpse(cars) # `glimpse` is similar to `str`
Rows: 50
Columns: 2
$ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
$ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
# cars is a built-in data set

Since we will be using this package a lot, it would be a little annoying to have to type dplyr:: every time. We can bypass this step by loading the package into our current environment. Think of this is “opening” the package for your work session.

# We could also do `library(dplyr)`, but we need the rest of the
# tidyverse packages later, so we might as well import the entire collection.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(cars)
Rows: 50
Columns: 2
$ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
$ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…

This needs to be done once for every new R session, and so it is common practice to keep a list of all the packages used at the top of your script or notebook for convenience and load all of it at start up.

That’s a lot of red though! What are these warning signs and checks?

All the warning signs indicate are the version of R that they were built under. They can frequently be ignored unless your version of R is so old that the packages can no longer be run on R! Note that packages are frequently updated, and functions may become deprecated.

Next, the warning shows you all the packages that were successfully installed.

Finally, there are some conflicts! All this means is that there are multiple functions with the same name that may do different things. R prioritizes functions from certain packages over others. So, in this case, the filter() function from dplyr will take precedent over the filter() function from the stats package. If you want to use the latter, use double colons :: to indicate that you are calling a function from a certain package:

4.3 Data wrangling with dplyr

Wrangling here is used in the sense of maneuvering, managing, controlling, and turning your data upside down and inside out to look at it from different angles in order to understand it. The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++), this means that many operations run much faster than similar tools in R. An additional feature is the ability to work directly with data stored in an external database, such as SQL-databases. The ability to work with databases is great because you are able to work with much bigger datasets (100s of GB) than your computer could normally handle. We will not talk in detail about this in class, but there are great resources online to learn more (e.g. this lecture from Data Carpentry).

4.3.1 Selecting columns and filtering rows

We’re going to learn some of the most common dplyr functions: select(), filter(), mutate(), group_by(), and summarise(). To select columns of a data frame, use select(). The first argument to this function is the data frame (surveys), and the subsequent arguments are the columns to keep. Note that we don’t need quotation marks around the column names here like with did with base R. You do still need quotation marks around strings, though!

select(surveys, plot_id, species_id, weight, year) %>% head()
  plot_id species_id weight year
1       2         NL     NA 1977
2       2         NL     NA 1977
3       2         NL     NA 1977
4       2         NL     NA 1977
5       2         NL     NA 1977
6       2         NL     NA 1977
# head explained below, allows us to see first couple of rows of the data frame

To choose rows based on a specific criteria, use filter():

filter(surveys, year == 1995) %>% head()
  record_id month day year plot_id species_id sex hindfoot_length weight
1     22314     6   7 1995       2         NL   M              34     NA
2     22728     9  23 1995       2         NL   F              32    165
3     22899    10  28 1995       2         NL   F              32    171
4     23032    12   2 1995       2         NL   F              33     NA
5     22003     1  11 1995       2         DM   M              37     41
6     22042     2   4 1995       2         DM   F              36     45
      genus  species   taxa plot_type
1   Neotoma albigula Rodent   Control
2   Neotoma albigula Rodent   Control
3   Neotoma albigula Rodent   Control
4   Neotoma albigula Rodent   Control
5 Dipodomys merriami Rodent   Control
6 Dipodomys merriami Rodent   Control

4.3.1.1 An aside on conditionals

Note that to check for equality, R requires two equal signs (==). This is to prevent confusion with object assignment, since otherwise year = 1995 might be interpreted as ‘set the year parameter to 1995’, which is not what filter does!

Basic conditionals in R are broadly similar to how they’re already expressed mathematically:

2 < 3
[1] TRUE
5 > 9
[1] FALSE

However, there are a few idiosyncrasies to be mindful of for other conditionals:

2 != 3 # not equal
[1] TRUE
2 <= 3 # less than or equal to
[1] TRUE
5 >= 9 # greater than or equal to
[1] FALSE

Finally, the %in% operator is used to check for membership:

2 %in% c(2, 3, 4) # check whether 2 in c(2, 3, 4)
[1] TRUE

All of the above conditionals are compatible with filter, with the key difference being that filter expects column names as part of conditional statements instead of individual numbers.

4.3.2 Chaining functions together using pipes

But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes. With intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects:

temp_df <- select(surveys, plot_id, species_id, weight, year)
filter(temp_df, year == 1995) %>% head()
  plot_id species_id weight year
1       2         NL     NA 1995
2       2         NL    165 1995
3       2         NL    171 1995
4       2         NL     NA 1995
5       2         DM     41 1995
6       2         DM     45 1995

You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out. Readability can be mildly improved by enabling “rainbow parentheses” (open settings > Code > Display and check rainbow parentheses), but it’s still basically impossible to document and effectively convey your work with this method.

filter(select(surveys, plot_id, species_id, weight, year), year == 1995) %>% head()
  plot_id species_id weight year
1       2         NL     NA 1995
2       2         NL    165 1995
3       2         NL    171 1995
4       2         NL     NA 1995
5       2         DM     41 1995
6       2         DM     45 1995

The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package that also is included in the tidyverse. If you use RStudio, you can type the pipe with Ctrl/Cmd + Shift + M.

surveys %>% 
    select(., plot_id, species_id, weight, year) %>% 
    filter(., year == 1995) %>% head()
  plot_id species_id weight year
1       2         NL     NA 1995
2       2         NL    165 1995
3       2         NL    171 1995
4       2         NL     NA 1995
5       2         DM     41 1995
6       2         DM     45 1995

The . refers to the object that is passed from the previous line. In this example, the data frame surveys is passed to the . in the select() statement. Then, the modified data frame which is the result of the select() operation, is passed to the . in the filter() statement. Put more simply: whatever was the result from the line above the current line, will be used in the current line.

Since it gets a bit tedious to write out all the dots, dplyr allows for them to be omitted. By default, the pipe will pass its input to the first argument of the right hand side function; in dplyr, the first argument is always a data frame. The chunk below gives the same output as the one above:

surveys %>% 
    select(plot_id, species_id, weight, year) %>% 
    filter(year == 1995) %>% 
  head()
  plot_id species_id weight year
1       2         NL     NA 1995
2       2         NL    165 1995
3       2         NL    171 1995
4       2         NL     NA 1995
5       2         DM     41 1995
6       2         DM     45 1995

Another example:

surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight) %>% head()
  species_id sex weight
1         PF   F      4
2         PF   F      4
3         PF   M      4
4         RM   F      4
5         RM   M      4
6         PF          4

In the above code, we use the pipe to send the surveys dataset first through filter() to keep rows where weight is less than 5, then through select() to keep only the species_id, sex, and weight columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include it as an argument to the filter() and select() functions anymore.

If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data. (Pipes work with non-dplyr functions, too, as long as either the dplyr or magrittr package is loaded).

surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight) %>% 
  head()
  species_id sex weight
1         PF   F      4
2         PF   F      4
3         PF   M      4
4         RM   F      4
5         RM   M      4
6         PF          4

If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name:

surveys_sml <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

surveys_sml
   species_id sex weight
1          PF   F      4
2          PF   F      4
3          PF   M      4
4          RM   F      4
5          RM   M      4
6          PF          4
7          PP   M      4
8          RM   M      4
9          RM   M      4
10         RM   M      4
11         PF   M      4
12         PF   F      4
13         RM   M      4
14         RM   M      4
15         RM   F      4
16         RM   M      4
17         RM   M      4

Note that the final data frame is the leftmost part of this expression.

A single expression can also be used to filter for several criteria, either matching all criteria (&) or any criteria (|):

surveys %>% 
    filter(taxa == 'Rodent' & sex == 'F') %>% 
    select(sex, taxa) %>% head()
  sex   taxa
1   F Rodent
2   F Rodent
3   F Rodent
4   F Rodent
5   F Rodent
6   F Rodent
surveys %>% 
    filter(species == 'clarki' | species == 'leucophrys') %>% 
    select(species, taxa) %>% head()
     species    taxa
1 leucophrys    Bird
2     clarki Reptile
3 leucophrys    Bird

4.3.2.1 Challenge

Using pipes, subset the survey data to include individuals collected before 1995 and retain only the columns year, sex, and weight.

4.3.3 Creating new columns with mutate

Frequently, you’ll want to create new columns based on the values in existing columns. For instance, you might want to do unit conversions, or find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of weight in kg:

surveys %>%
    mutate(weight_kg = weight / 1000) %>% head()
  record_id month day year plot_id species_id sex hindfoot_length weight
1         1     7  16 1977       2         NL   M              32     NA
2        72     8  19 1977       2         NL   M              31     NA
3       224     9  13 1977       2         NL                  NA     NA
4       266    10  16 1977       2         NL                  NA     NA
5       349    11  12 1977       2         NL                  NA     NA
6       363    11  12 1977       2         NL                  NA     NA
    genus  species   taxa plot_type weight_kg
1 Neotoma albigula Rodent   Control        NA
2 Neotoma albigula Rodent   Control        NA
3 Neotoma albigula Rodent   Control        NA
4 Neotoma albigula Rodent   Control        NA
5 Neotoma albigula Rodent   Control        NA
6 Neotoma albigula Rodent   Control        NA

You can also create a second new column based on the first new column within the same call of mutate():

surveys %>%
    mutate(weight_kg = weight / 1000,
           weight_kg2 = weight_kg * 2) %>% head()
  record_id month day year plot_id species_id sex hindfoot_length weight
1         1     7  16 1977       2         NL   M              32     NA
2        72     8  19 1977       2         NL   M              31     NA
3       224     9  13 1977       2         NL                  NA     NA
4       266    10  16 1977       2         NL                  NA     NA
5       349    11  12 1977       2         NL                  NA     NA
6       363    11  12 1977       2         NL                  NA     NA
    genus  species   taxa plot_type weight_kg weight_kg2
1 Neotoma albigula Rodent   Control        NA         NA
2 Neotoma albigula Rodent   Control        NA         NA
3 Neotoma albigula Rodent   Control        NA         NA
4 Neotoma albigula Rodent   Control        NA         NA
5 Neotoma albigula Rodent   Control        NA         NA
6 Neotoma albigula Rodent   Control        NA         NA

The first few rows of the output are full of NAs, so if we wanted to remove those we could insert a filter() in the chain:

surveys %>%
    filter(!is.na(weight)) %>%
    mutate(weight_kg = weight / 1000) %>% head()
  record_id month day year plot_id species_id sex hindfoot_length weight
1       588     2  18 1978       2         NL   M              NA    218
2       845     5   6 1978       2         NL   M              32    204
3       990     6   9 1978       2         NL   M              NA    200
4      1164     8   5 1978       2         NL   M              34    199
5      1261     9   4 1978       2         NL   M              32    197
6      1453    11   5 1978       2         NL   M              NA    218
    genus  species   taxa plot_type weight_kg
1 Neotoma albigula Rodent   Control     0.218
2 Neotoma albigula Rodent   Control     0.204
3 Neotoma albigula Rodent   Control     0.200
4 Neotoma albigula Rodent   Control     0.199
5 Neotoma albigula Rodent   Control     0.197
6 Neotoma albigula Rodent   Control     0.218

is.na() is a function that determines whether something is an NA. The ! symbol negates the result, so we’re asking for everything that is not an NA.

4.3.3.1 Challenge

Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_half containing values that are half the hindfoot_length values. In this hindfoot_half column, there are no NAs and all values are less than 30.

Hint: think about how the commands should be ordered to produce this data frame!

4.4 Split-apply-combine techniques in dplyr

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

dplyr facilitates this workflow through the use of group_by() to split data and summarize(), which collapses each group into a single-row summary of that group. The arguments to group_by() are the column names that contain the categorical variables for which you want to calculate the summary statistics. Let’s view the mean weight by sex.

surveys %>%
    group_by(sex) %>%
    summarize(mean_weight = mean(weight))
# A tibble: 3 × 2
  sex   mean_weight
  <chr>       <dbl>
1 ""             NA
2 "F"            NA
3 "M"            NA

The mean weights become NA since there are individual observations that are NA. Let’s remove those observations.

surveys %>%
    filter(!is.na(weight)) %>%
    group_by(sex) %>%
    summarize(mean_weight = mean(weight))
# A tibble: 3 × 2
  sex   mean_weight
  <chr>       <dbl>
1 ""           64.7
2 "F"          42.2
3 "M"          43.0

There is one row here that is neither male nor female, these are observations where the animal escaped before the sex could not be determined. Let’s remove those as well.

surveys %>%
    filter(!is.na(weight) & !is.na(sex)) %>%
    group_by(sex) %>%
    summarize(mean_weight = mean(weight))
# A tibble: 3 × 2
  sex   mean_weight
  <chr>       <dbl>
1 ""           64.7
2 "F"          42.2
3 "M"          43.0

You can also group by multiple columns:

surveys %>%
    filter(!is.na(weight) & !is.na(sex)) %>%
    group_by(genus, sex) %>%
    summarize(mean_weight = mean(weight))
`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
# A tibble: 28 × 3
# Groups:   genus [10]
   genus       sex   mean_weight
   <chr>       <chr>       <dbl>
 1 Baiomys     "F"          9.16
 2 Baiomys     "M"          7.36
 3 Chaetodipus ""          19.8 
 4 Chaetodipus "F"         23.8 
 5 Chaetodipus "M"         24.7 
 6 Dipodomys   ""          81.4 
 7 Dipodomys   "F"         55.2 
 8 Dipodomys   "M"         56.2 
 9 Neotoma     ""         168.  
10 Neotoma     "F"        154.  
# ℹ 18 more rows

Since we will use the same filtered and grouped data frame in multiple code chunks below, we could assign this subset of the data to a new variable and use this variable in the subsequent code chunks instead of typing out the functions each time.

filtered_surveys <- surveys %>%
    filter(!is.na(weight) & !is.na(sex)) %>%
    group_by(genus, sex)

If you want to display more data, you can use the print() function at the end of your chain with the argument n specifying the number of rows to display.

filtered_surveys %>%
    summarize(mean_weight = mean(weight)) %>%
    print(n = 15) # Will change the knitted output, not the notebook
`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
# A tibble: 28 × 3
# Groups:   genus [10]
   genus       sex   mean_weight
   <chr>       <chr>       <dbl>
 1 Baiomys     "F"          9.16
 2 Baiomys     "M"          7.36
 3 Chaetodipus ""          19.8 
 4 Chaetodipus "F"         23.8 
 5 Chaetodipus "M"         24.7 
 6 Dipodomys   ""          81.4 
 7 Dipodomys   "F"         55.2 
 8 Dipodomys   "M"         56.2 
 9 Neotoma     ""         168.  
10 Neotoma     "F"        154.  
11 Neotoma     "M"        166.  
12 Onychomys   ""          23.4 
13 Onychomys   "F"         26.8 
14 Onychomys   "M"         26.2 
15 Perognathus ""           6   
# ℹ 13 more rows

Once the data are grouped, you can also summarize multiple variables at the same time. For instance, we could add a column indicating the minimum weight for each species for each sex:

filtered_surveys %>%
    summarize(mean_weight = mean(weight),
              min_weight = min(weight))
`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
# A tibble: 28 × 4
# Groups:   genus [10]
   genus       sex   mean_weight min_weight
   <chr>       <chr>       <dbl>      <int>
 1 Baiomys     "F"          9.16          6
 2 Baiomys     "M"          7.36          6
 3 Chaetodipus ""          19.8          10
 4 Chaetodipus "F"         23.8           5
 5 Chaetodipus "M"         24.7           4
 6 Dipodomys   ""          81.4          24
 7 Dipodomys   "F"         55.2          10
 8 Dipodomys   "M"         56.2          12
 9 Neotoma     ""         168.           83
10 Neotoma     "F"        154.           32
# ℹ 18 more rows

4.4.0.1 Challenge

  1. Use group_by() and summarize() to find the mean, min, and max hindfoot length for each species.

  2. What was the heaviest animal measured in each year? Return the columns year, genus, species, and weight.

4.4.1 Using tally to summarize categorical data

When working with data, it is also common to want to know the number of observations found for each factor or combination of factors. For this, dplyr provides tally(). For example, if we want to group by taxa and find the number of observations for each taxa, we would do:

surveys %>%
    group_by(taxa) %>%
    tally()
# A tibble: 4 × 2
  taxa        n
  <chr>   <int>
1 Bird      450
2 Rabbit     75
3 Reptile    14
4 Rodent  34247

We can also use tally() when grouping on multiple variables:

surveys %>%
    group_by(taxa, sex) %>%
    tally()
# A tibble: 6 × 3
# Groups:   taxa [4]
  taxa    sex       n
  <chr>   <chr> <int>
1 Bird    ""      450
2 Rabbit  ""       75
3 Reptile ""       14
4 Rodent  ""     1209
5 Rodent  "F"   15690
6 Rodent  "M"   17348

Here, tally() is the action applied to the groups created by group_by() and counts the total number of records for each category.

If there are many groups, tally() is not that useful on its own. For example, when we want to view the five most abundant species among the observations:

surveys %>%
    group_by(species) %>%
    tally()
# A tibble: 40 × 2
   species             n
   <chr>           <int>
 1 albigula         1252
 2 audubonii          75
 3 baileyi          2891
 4 bilineata         303
 5 brunneicapillus    50
 6 chlorurus          39
 7 clarki              1
 8 eremicus         1299
 9 flavus           1597
10 fulvescens         75
# ℹ 30 more rows

Since there are 40 rows in this output, we would like to order the table to display the most abundant species first. In dplyr, we say that we want to arrange() the data.

surveys %>%
    group_by(species) %>%
    tally() %>%
    arrange(n)
# A tibble: 40 × 2
   species          n
   <chr>        <int>
 1 clarki           1
 2 scutalatus       1
 3 tereticaudus     1
 4 tigris           1
 5 uniparens        1
 6 viridis          1
 7 leucophrys       2
 8 savannarum       2
 9 fuscus           5
10 undulatus        5
# ℹ 30 more rows

Still not that useful. Since we are interested in the most abundant species, we want to display those with the highest count first, in other words, we want to arrange the column n in descending order:

surveys %>%
    group_by(species) %>%
    tally() %>%
    arrange(desc(n)) %>%
    head(5)
# A tibble: 5 × 2
  species          n
  <chr>        <int>
1 merriami     10596
2 penicillatus  3123
3 ordii         3027
4 baileyi       2891
5 megalotis     2609

If we want to include more attributes about these species, we can include these in the call to group_by():

surveys %>%
    group_by(species, taxa, genus) %>%
    tally() %>%
    arrange(desc(n)) %>%
    head(5)
# A tibble: 5 × 4
# Groups:   species, taxa [5]
  species      taxa   genus               n
  <chr>        <chr>  <chr>           <int>
1 merriami     Rodent Dipodomys       10596
2 penicillatus Rodent Chaetodipus      3123
3 ordii        Rodent Dipodomys        3027
4 baileyi      Rodent Chaetodipus      2891
5 megalotis    Rodent Reithrodontomys  2609

Be careful not to include anything that would split the group into subgroups, such as sex, year etc.

4.4.1.1 Challenge

  1. How many individuals were caught in each plot_type surveyed?

  2. You saw above how to count the number of individuals of each sex using a combination of group_by() and tally(). How could you get the same result using group_by() and summarize()? Hint: see ?n.

4.5 Reshaping with pivot_wider and pivot_longer

4.5.1 Defining wide vs long data

The survey data presented here is almost in what we call a long format – every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots/trials. For example, what is the relationship between mean weights of different genera across all plots?

To answer that question, we want each plot to have its own row, with each measurements in its own column. This is called a wide data format. For the surveys data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we can reduce the dataset and begin to look for some relationships we’d want to examine. We need to create a new table where each row is the values for a particular variable associated with each plot. In practical terms, this means the values in genus would become the names of column variables and the cells would contain the values of the mean weight observed on each plot by genus.

We can use the functions called pivot_wider() and pivot_longer() (these are newer replacements for spread() and gather(), which were the older functions). These can feel tricky to think through, but do not feel alone in this! Many others have squinted at their data, unsure exactly how to reshape it, so there are many guides and cheatsheets available to help!

4.5.2 Summary of long vs wide formats

Long format:

  • every column is a variable
    • first column(s) repeat
  • every row is an observation

Wide format:

  • each row is a measured thing
  • each column is an independent observation
    • first column does not repeat

4.5.3 Long to Wide with pivot_wider

Let’s start by using dplyr to create a data frame with the mean body weight of each genus by plot.

surveys_gw <- surveys %>%
    filter(!is.na(weight)) %>%
    group_by(genus, plot_id) %>%
    summarize(mean_weight = mean(weight))
`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
surveys_gw %>% head()
# A tibble: 6 × 3
# Groups:   genus [1]
  genus   plot_id mean_weight
  <chr>     <int>       <dbl>
1 Baiomys       1        7   
2 Baiomys       2        6   
3 Baiomys       3        8.61
4 Baiomys       5        7.75
5 Baiomys      18        9.5 
6 Baiomys      19        9.53

Now, to make this long data wide, we use pivot_wider() from tidyr to spread out the different taxa into columns. pivot_wider() takes 3 arguments: the data , the names_from column variable that will eventually become the column names, and the values_from column variable that will fill in the values. We’ll use a pipe so we don’t need to explicitly supply the data argument.

surveys_gw_wide <- surveys_gw %>% 
  pivot_wider(names_from = genus, values_from = mean_weight)

head(surveys_gw_wide)
# A tibble: 6 × 11
  plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
    <int>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
1       1    7           22.2      60.2    156.      27.7        9.62       22.2
2       2    6           25.1      55.7    169.      26.9        6.95       22.3
3       3    8.61        24.6      52.0    158.      26.0        7.51       21.4
4       5    7.75        18.0      51.1    190.      27.0        8.66       21.2
5      18    9.5         26.8      61.4    149.      26.6        8.62       21.4
6      19    9.53        26.4      43.3    120       23.8        8.09       20.8
# ℹ 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>, Spermophilus <dbl>

Now we can go back to our original question: what is the relationship between mean weights of different genera across all plots? We can easily see the weights for each genus in each plot! Notice that some genera have NA values. That’s because some genera were not recorded in that plot.

You may have used spread()in the past, which also takes three arguments: the data, the key column (or column with identifying information), and the values column (the one with the numbers/values).

surveys_gw_wide0 <- surveys_gw %>%
  spread(key = genus, value = mean_weight) 

head(surveys_gw_wide0)
# A tibble: 6 × 11
  plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
    <int>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
1       1    7           22.2      60.2    156.      27.7        9.62       22.2
2       2    6           25.1      55.7    169.      26.9        6.95       22.3
3       3    8.61        24.6      52.0    158.      26.0        7.51       21.4
4       4   NA           23.0      57.5    164.      28.1        7.82       22.6
5       5    7.75        18.0      51.1    190.      27.0        8.66       21.2
6       6   NA           24.9      58.6    180.      25.9        7.81       21.8
# ℹ 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>, Spermophilus <dbl>

4.5.4 Wide to long with gather and pivot_longer

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we can use pivot_longer() to gather a set of columns into one key-value pair. To go backwards from surveys_gw_wide, we should exclude plot_id.

pivot_longer() takes 4 arguments: the data, the names_to column variable that comes from the column names, the values_to column with the values, and cols which specifies which columns we want to keep or drop. Again, we will pipe from the dataset so we don’t have to specify the data argument:

surveys_gw_long2 <- surveys_gw_wide %>% 
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)

surveys_gw_long2
# A tibble: 240 × 3
   plot_id genus           mean_weight
     <int> <chr>                 <dbl>
 1       1 Baiomys                7   
 2       1 Chaetodipus           22.2 
 3       1 Dipodomys             60.2 
 4       1 Neotoma              156.  
 5       1 Onychomys             27.7 
 6       1 Perognathus            9.62
 7       1 Peromyscus            22.2 
 8       1 Reithrodontomys       11.4 
 9       1 Sigmodon              NA   
10       1 Spermophilus          NA   
# ℹ 230 more rows

If the columns are directly adjacent as they are here, we don’t even need to list the all out: we can just use the : operator, as before.

surveys_gw_wide %>% 
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = Baiomys:Sigmodon)
# A tibble: 216 × 4
   plot_id Spermophilus genus           mean_weight
     <int>        <dbl> <chr>                 <dbl>
 1       1           NA Baiomys                7   
 2       1           NA Chaetodipus           22.2 
 3       1           NA Dipodomys             60.2 
 4       1           NA Neotoma              156.  
 5       1           NA Onychomys             27.7 
 6       1           NA Perognathus            9.62
 7       1           NA Peromyscus            22.2 
 8       1           NA Reithrodontomys       11.4 
 9       1           NA Sigmodon              NA   
10       2           NA Baiomys                6   
# ℹ 206 more rows

Note that now the NA genera are included in the long format.

In the past, you may have used gather(). We give it the arguments of a new key and value column name, and then specify which columns we either want or do not want gathered up. So, togo backwards from surveys_gw_wide, and exclude plot_id from the gathering, we would do the following:

surveys_gw_long1 <- surveys_gw_wide0 %>%
  gather(genus, mean_weight, -plot_id) 

head(surveys_gw_long1)
# A tibble: 6 × 3
  plot_id genus   mean_weight
    <int> <chr>         <dbl>
1       1 Baiomys        7   
2       2 Baiomys        6   
3       3 Baiomys        8.61
4       4 Baiomys       NA   
5       5 Baiomys        7.75
6       6 Baiomys       NA   

4.5.4.1 Challenge

Starting with the surveys_gw_wide dataset, how would you display a new dataset that gathers the mean weight of all the genera (excluding NAs) except for the genus Perognathus?


  1. This course is focused on tidyverse functions, because that seems to be the trend these days. Although all of our teaching material is written in tidy lingo, it is mostly for the sake of consistency. In all honesty, tidy is pretty great, but some functions are more intuitive in base, so most people code in a mix of the two. If you learned base R elsewhere and prefer to use those functions instead, by all means, go ahead. The correct code is code that does what you want it to do.↩︎