Assignment 3: Advanced data wrangling and visualization (8 marks)

Download the .Rmd file here.

To submit this assignment, upload the full document, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf. Please ensure the text on your .pdf does not continue past the end of the page.

Review: Before the assignment, review the data manipulation and plotting functions learned in class, such as read_csv, filter, select, mutate, arrange, head, group_by, summarize, tally, pivot_longer, pivot_wider, ggplot, geom_line, geom_point, labs, theme

1. Read in and pre-process plant biomass data (1.75 marks)

You will apply your data wrangling skills on the yearly change in biomass of plants in the beautiful Abisko national park in northern Sweden. We have preprocessed this data and made it available as a csv file via this link. You can find the original data on Dryad, and the full-text of the original study1 in this pdf. Reading through the study abstract will increase your understanding for working with the data.

a. Import the data directly into R from the provided URL, assign it to a variable called plant_biomass, and display the first six rows. (0.25 mark)

b. Convert the Latin column names into their common English names: lingonberry, bilberry, bog bilberry, dwarf birch, crowberry, and wavy hair grass. After this, display all column names. (0.25 marks)

Hint: Read the documentation on the dplyr function rename. Search online to find out which Latin and English names pair up.

c. This is a wide data frame (species make up the column names). A long format is easier to analyze, so gather the species names into one column (species) and the measurement values into another column (biomass). Keep all other columns. Assign it to the variable plant_biomass_long, and show a preview of this new data frame (0.5 marks)

d. Describe how the dimensions of the data frame have change between plant_biomass and plant_biomass_long . In this example, which format is more efficient in terms of the memory used to store the data? For the less efficient format, what is one potential benefit of this format that might make it worth the extra storage space required? (0.25 marks)

e. Recreate the wide data frame (species names as columns again) by pivoting it from your plant_biomass_long data frame. Don’t overwrite your original plant_biomass variable! You don’t need to save this re-widened data frame, but if you do, give it a different name (0.5 marks)

2. Wrangling plant biomass with dplyr (3 marks)

Now that our data is in a tidy format, we can start exploring it!

a. What is the average biomass in g/m2 for all observations in the study? (0.25 marks)

b. How does the average biomass compare between the grazed control sites and those that were protected from herbivores? (0.25 marks)

c. Display a table of the average plant biomass for each year. (0.25 marks)

d. What is the mean plant biomass per year for the grazedcontrol and rodentexclosure groups? Present the answer in a table that has these variables as column headers (use pivoting). (0.75 marks)

e. Check whether there is an equal number of observations per site. (0.25 marks)

f. How many biomass measurements were 0? Which species had the most 0 biomass measurements? (0.5 marks)

g. Create a new column that represents the square of the biomass. Display the three largest squared_biomass observations in descending order. Only include the columns year, squared_biomass and species and only observations between the years 2003 and 2008 from the forest habitat. (0.75 mark)

Hint: Break this down into single criteria and add one at a time. “Square” means taken to the power of 2. It does NOT mean square root (power of 1/2).

3. Visualising plant biomass (3.25 marks)

a. Compare the mean biomass over time for grazedcontrol with that of rodentexclosure graphically in a line plot. What could explain the big dip in biomass year 2012? (0.5 marks)

Hint: The published study might be able to help with the second question

b. Compare the mean biomass for each species in a lineplot. (0.5 marks)

c. We’ve found that the biomass is higher in the sites with rodent exclosures (especially in recent years), and that the crowberry is the dominant species. Notice how the lines for rodentexclosure and crowberry are of similar shape. Coincidence? Let’s find out! Use a facetted line plot to explore whether all plant species are impacted equally by grazing. (0.75 mark)

d. The habitat could also be affecting the biomass of different species. Explore this in a faceted line plot of the mean biomass over time. (0.5 marks)

e. Explore the relationship between species, habitat, and the distribution of biomass in a box plot. (0.5 marks)

f. It looks like both habitat and treatment have an effect on most of the species! Let’s dissect the data further by visualizing the effect of both the habitat and treatment on each species by faceting either a box plot or line plot accordingly. (0.5 mark)


  1. Olofsson J, te Beest M, Ericson L (2013) Complex biotic interactions drive long-term vegetation dynamics in a subarctic ecosystem. Philosophical Transactions of the Royal Society B 368(1624): 20120486. https://dx.doi.org/10.1098/rstb.2012.0486↩︎