7  Review using Farrell and Davies (2019) dataset

You can download the .Rmd file for this class here.

7.1 Introduction

The goal of this class is to review concepts from the first half of the course, and to become comfortable writing your own code to wrangle and explore data. It also provides an opportunity to discuss project ideas with your class-mates. (We would like you to form groups before the class on Thursday Sept. 26, which will be devoted to project work, or by the end of that class at the latest.) Since the 2nd half of the course requires familiarity with the basics of R, this is a good point to assess if there are programming principles or specific concepts which you need to brush up on.

Here is how the class will work: you will split up into groups for each of the questions below, briefly discuss your interests for the group project, and work on the questions collaboratively. The questions will guide you through an exploratory analysis of a nice dataset generated by Farrell and Davies. In their paper, Farrell and Davies seek to test the hypothesis that disease-induced mortality (or virulence of a pathogen) increases with the degree to which an infected host species is diverged from other mammalian hosts. Your answers will NOT be marked, but Mete and Zoë are happy to check your answers and to provide guidance if you become stuck. We have also provided some tips and tricks regarding best practices for troubleshooting your code, and encourage you to try to solve the problems by applying some of those practices.

7.2 Learning preamble

7.2.1 Learning objectives

  • Become more confortable writing code on your own
  • Review concepts from the first half of the course, including
    • Reading datasets into R
    • Navigating between directories, including absolute and relative file paths
    • Summarizing information about the structure of data
    • Using ggplot to visualize data
    • Common uses of dplyr functions: mutate, select, subset, summarize, group_by, pivot_longer, etc.
    • Exporting plots
  • Become more comfortable with troubleshooting code
  • Discuss project ideas with your class-mates!

7.2.2 Lesson outline:

  • Troubleshooting (5 min)
  • Intro to the data (5 min)
  • Exercise 1 (20 min)
  • Exercise 2 (15 min)
  • Exercise 3 (35 min)
  • Exercise 4 (15 min)
  • Exercise 5 (15 min)

7.3 Best practices for troubleshooting code

7.3.1 Tips

If you have an error message, look for the FIRST occurrence of an error in the output and Google that line.

Tips for Googling:

  • Include the name of the function that failed in your query
    • If you’re not sure what failed, look for function names in that first error line (names with () after them) or the line above it
  • Remove any variable names or values from your query
  • If you’re given line numbers, use them to double-check your script at that line location for syntax mistakes
    • If you got the error after running a whole code chunk, you can also try running your code line by line (with cmd+enter on Mac or ctrl+enter on PC) to pinpoint the source of the problem

If you’re not getting output you expect, test what you’re trying to do with toy data (make a tiny data frame or vector) and/or use your actual data but perform the operations one at a time.

Generally, ensure you understand any warning messages that printed by your code. You may decide to change your script to eliminate the message, but make that decision carefully. This could be affecting analyses you’re performing later in your script. Google is your friend here as well.

It is absolutely disheartening to spend a lot of time trying to fix something. We’re not here to dismiss that feeling. It’ll take a while to get good at recognizing what keywords will be important for your Google searches & being able to fully understand the Stack Overflow posts. Troubleshooting and debugging are very valuable skills, though, and it’s worth the effort!

7.3.2 Quick Examples

library(tdyverse)

output: Error in library(tdyverse) : there is no package called ‘tdyverse’

What would you search?


x = 7

if (x < 5) {
  print("x is less than 5")
else {
  print("x is greater than 5")
}
  
rm(x)

output: Error: unexpected 'else' in:" print("x is less than 5") else"

What would you search?


mean(iris$Species)

output: Warning: argument is not numeric or logical: returning NA

What would you search?

7.4 The dataset

Information regarding the columns of the dataset are below. Each row corresponds to a particular host-parasite combination, and includes information about the order of the host, type of parasite, number of documented host species per parasite, the number of reported cases and deaths in the focal host, if transmission of the parasite is vector-borne, the mean phylogenetic distance of the focal host species from all species known to host the parasite, and information regarding the times and locations of case and fatality count data.

  • Year – Year of report
  • Country – Country name
  • Disease – OIE reported disease name
  • Parasite – Parasite Latin binomial
  • Host – Latin binomial of host reported by World Organisation for Animal Health
  • Cases – Number of reported cases
  • Deaths – Number of reported deaths due to disease
  • Destroyed – Number of animals reported to have been destroyed
  • Slaughtered – Number of animals reported to have been slaughtered
  • HostOrder – Taxonomic order of host species
  • SR – The number of documented host species per parasite, also referred to as “host species richness”
  • EvoIso – Host evolutionary isolation (in millions of years); see Figure 1 and SI 1.3 for details.
  • Taxonomic Range – The largest taxonomic range among documented host species (1-6 representing parasites restricted to a single species, genus, family, order, class, or multiple classes, respectively)
  • ParaType – Gross parasite taxonomic group (arthropod, bacteria, fungi, helminth, protozoa, virus)
  • ParaFamily – Parasite taxonomic family
  • Vector – Binary variable (0/1) indicating whether parasite can be transmitted by an arthropod vector
  • EnviroRestingStage – Binary variable (0/1) indicating whether parasite has a resting stage capable of persisting for long periods of time in the environment (typically months to years)
  • AvianReservoir –Binary variable (0/1) indicating whether parasite has been documented to use avian species as reservoir hosts
  • Reproduction – Binary variable (0/1) indicating whether parasite can be vertically transmitted as a function of reproduction (either vertically transmitted, sexually transmitted, or passed from mother to offspring via ingestion of milk or colostrum
  • WOK_citations – number of Web of Knowledge citations per parasite
  • WOK_citations_noHuman – number of Web of Knowledge citations per parasite excluding those that reference humans
  • latitude – latitude of country in degrees
  • gdp – Gross Domestic Product of country (current US$); WDI code “NY.GDP.MKTP.CD”
  • gdp_pcap – Gross Domestic Product of country per capita (current US$); WDI code “NY.GDP.PCAP.CD”

7.5 Exercise 1 (20 min)

a. Use download.file() to download the disease_distance.csv dataset from the EEB313 website. Call the computer on your file disease_distance.csv.

b. Clear your environment and console.

c. Read the disease_distance.csv dataset into R in two ways. Name the data object disease_data.

  • Change your working directory to the location where the file is saved using setwd(), and provide read_csv() with the relative path to the data.
  • Provide the absolute file path to the Farrell and Davies data when you call read_csv().

d. View aspects of the data in the console.

  • Run str() on the data and interpret the output.
  • Print the first 100 rows of the data. Print the last 100 rows.
  • Assign to an object new_df a data frame with 1000 randomly selected rows. Do this using the tidy function slice_sample(). Hint: use the documentation for the slice() function found here.
  • Assign to an object new_df2 a data frame with every 10th row of the data. Do this using base R. Hint: use seq() to create a vector whose entries are 10, 20, 30, … and use this vector to extract the rows of interest.
  • Select columns Reproduction, Host, Cases, Country, and Deaths from the new_df2 data frame you just made.
  • Run class() on the following columns: Reproduction, Deaths, and Country. Which of these columns can be effectively treated — or, in other words, coerced — into a logical vector (i.e., a vector whose entries are either TRUE or FALSE)?

From here on out, use the FULL data frame (i.e., disease_data, where all columns and rows present). To ensure that you do not call new_df, run new_df <- NULL. Try to think of a few reasons why, in writing a program for a research project, one would write over an intermediate data frame or file.

7.6 Exercise 2 (15 min)

a. Remove any rows with missing data using the tidy function drop_na(). Documentation can be found here. Are there any missing rows? Why or why not?

b. A calculation and some coersion!

  • Use the function mutate to estimate the case fatality rate by dividing number of individuals that die of a disease by the number of confirmed cases. Return a data frame with one column, corresponding to the case fatality rates that were just calculated.
  • Coerce the case fatality rate data column into a numeric vector using as.numeric(). What function might you use to coerce something into a character vector? (R is nice in that the naming conventions are very natural…when you get the hang of them!)
  • What do you notice about the distribution of case fatality rates?
  • Try to identify conditions under which the case fatality rate would be a problematic measure of parasite virulence?

c. Use tidyverse to count the number of observations associated to each parasite. Do the same thing for each host, and then for each host-country combination.

d. Compute mean, median, max, and min of Cases for each host in each country. Then, filter the resulting data frame so that rows where the mean number is >5000 are kept. What countries are such that the mean number is >5000? Think about why some countries may have fewer reported cases (key word: reported).

7.7 Exercise 3 (35 min)

a. Histograms, histograms, histograms!

  • Make a histogram of the number of cases, and another for the number of deaths. What do you notice?
  • Make histograms of the numbers of the log-transformed cases and deaths. What do you notice? By log transformed, we mean the logarithm (base 10) of the variable. Are there any warnings which appear and, if so, why?
  • Use the function unique() to determine the unique host species represented in the dataset. How many are there?
  • Make a histogram of the log-transformed number of deaths, faceted by host. Play with the parameter bins. The default for this parameter is bins = 30. What happens if you set bins = 1 and bins = 100?
  • Make a histogram of the number of deaths over the number of cases by host order and parasite type. Hint: look at the documentation for facet_grid().
  • Interpret the histograms you just made. Do you notice anything interesting? (For example, think about what parasite types hosts that die in large proportions relative to the number of reported cases…)

b. Boxplots (and a bunch of other geoms…)

  • Explore the documentation for ggplot “geoms”, which can be found here here.
  • Make a boxplot with host on the x axis and the log-transformed number of deaths on the y axis. Call this plot p.
  • Explore the structure of the plot object p using the global environment and, after this, the str() function?
  • Try changing the theme. Classics include bw(), classic(), light(), and void(). Which ones do you like? (More on ggplot themes can be found here.) Also, check out the GitHub repo for the newly-developed Barbie and Oppenheimer themes here! You can use these themes by first installing the package ThemePark.
  • Use Google to detemrine what to add to the ggplot so the text on the x axis is rotated by 90 degrees (i.e., is readable).
  • Export the plot as a .png and then as a .pdf.

7.8 Exercise 4 (15 min)

a. Visualize the data to see if there is relationship between the case fatality rate, the number of deaths over the number of cases, for a host and evolutionary isolation from the other hosts?

Based on your visualization of the data, what you predict the relationship between these variables is? How would you test your understand and/or visualize the relationship in a different way? If constructing a statistical model to determine if there is relationship, are there variables in the dataset that you think would be important to control for?

After thinking about these questions, read the abstract and significance statement for the paper from which this dataset is from. Discuss with your group what what exploratory analysis of data can and cannot do, based on the information in the paper and what you noticed about your visualization.

7.9 Exercise 5 (15 min)

a. Check out this documentation for the package ggridges.

b. Install ggridges and make sure to load it using library() or require().

c. Play around with ggridges! Using the disease_distance dataset which you have been exploring this lecture, make a ridgeline plot with a continuous variable on the x axis and discrete variable on the y. If you like, try to customize it (e.g., with fun colors)!

d. Run the following code chunk :)

install.packages("praise")
library(praise)
for (i in 1:10) { print(praise()) }