To submit this assignment, upload the full document, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf. Please ensure the text on your .pdf does not continue past the end of the page.
1. Simulating genomic mutations (2.5 marks)
a. Assume that as a strand of genomic DNA (or RNA, if we’re thinking about viruses!) is copied, mutations can be randomly introduced. There is a constant probability of mutation at each nucleotide (mutation_rate). What probability distribution best describes the distribution of the number of mutations you expect to to see in a given genome of fixed length? Why? (0.25 marks)
Answer:
b. What distribution best describes the distance (measured in number of nucleotides) between each subsequent mutation? Why? (0.25 marks)
Answer:
c. If the genome length is 1,000 and the mutation rate is 0.05, what is the average number of mutations you expect to see per genome? (Write code to answer this, creating variables genome_length and mutation_rate the input numbers, and assigning the answer to avg_num_mutations. (0.25)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
d. Draw a random number to represent the exact number of mutations observed in a particular genome, using the genome length and mutation rate above, and assign this to num_mutations. (0.25)
e. To choose (randomly) the precise positions in the genome where mutations will occur, first create a vector that containing a number indicating each genomic position (genome_indices). Then sample a random subset of size num_mutations from this vector. These will be the locations of the mutations on the genome. Finally, sort the positions of the mutations into ascending order, and call the final answer mutation_indices (0.5 marks)
f. Calculate the number of nucleotides between each mutation using the diff function, and call this inter_mutation_distance (0.25 marks)
g. Write a single code chunk to complete all of these steps (c,d,e,f) together and plot a histogram of inter_mutation_distance. Increase the genome length to 10^6 for your final results. Adjust your histogram so that each single nucleotide extra distance is its own bin. Describe the shape of the distribution that you see (0.75 marks)
2. Generating sampling distributions (1.5 pts)
We’ll continue with our mutation rate example!
a. Imagine you are doing an experiment to estimate the mutation rate and understand the distribution of mutations across the genome. Sequencing and assembling the entire genome is costly and complex, so you’ll just sequencing and analyze shorter sections. Assume that with your shorter sequences, you are going to analyze the location of just 50 mutations (n_muts_sample). Assume that inter-mutation distances in your shorter sequences follow the same distribution as those you measured above in the full genome. Write code to draw a sample of this size from the previously-measured inter_mutation_distances and call this sub_genomic_sample (0.5 marks)
b. You decide to generate and analyze 300 of these shorter sequences (n_samples). Write code (hint : use a for-loop) to generate each of these samples, calculate the average distance between mutations for each (avg_subgenomic_sample), and then plot a histogram of these means (0.5 marks)
c. Describe the resulting distribution. Compare it’s shape to that obtained in 1g. Are you surprised by the comparison? What statistical principle is at play here? (0.5 marks)
Answer:
3. Probability distributions and likelihood (1 mark)
When working with random numbers, it is often a good idea to use the function set.seed(). This function ensures that the random numbers that are generated are reproducible, including on other machines. Please read the documentation for this function and this blog post to learn more. (0 marks)
Simulate \(n = 1000\) realizations (i.e., draws) of a Geometric random variable with success probability \(p = 0.1\). Put these realizations in a vector x. Make a histogram of these realizations, using the default number of bins (0.25 marks)
Extract the first element of x, i.e., the first realization of the Geometric(\(p = 0.1\)) distribution. How likely is it that this observation arose from a Geometric distribution with probability \(p = 0.9\)? How likely is it that the observation arose from a Geometric distribution with probability \(p = 0.5\)? (0.25 marks)
What is the probability that each element of x arose due to a Geometric distribution with \(p = 0.5\)? Store these probabilities in a vector, apply log to each element of the vector, and print the sum the elements of the vector of log-transformed probabilities. This is the log-likelihood of the data — i.e., the realizations of the Geometric distribution you simulated in (a) — at \(p=0.5\). (0.5 marks)
4. Maximum likelihood estimation for binomial probabilities (3 marks)
For this problem, it will be useful to carefully review the likelihood calculation we did in Lecture 8 to estimate disease mortality rate from reports of cases and deaths, using data from Farrell & Davies, PNAS, 2019 that measured cases and deaths for a set of infectious diseases that each infect a wide range of animal species.
Load the Farrell & Davies 2019 dataset into R and call it disease_distance (0 marks)
Use the unique() function to determine all unique elements of the Host and ParaFamily columns. Then, use the expand.grid() function to generate a data frame with all combinations of hosts and parasite families. Store the data frame of host and parasite family combinations in an object called Combinations. Hint: look at the documentation for expand.grid(). (0.25 marks)
We are going to a function that takes in a host and parasite family (i.e., a specific row of the Combination data frame you have just made), and returns the maximum likelihood estimate for the probability of death, given infection, of that host being infected with members of the parasite family. In class, we did this just for the host species Sus scrofa (the wild boar) and parasite family Coronavirinae. Like we did in class, assume deaths are Binomial with a host-specific number of cases and probability of death, and those in different countries and years are independent. If there are no observations for a specific host and parasite family combination, your function should return NA; else, the function should return the value of \(p\) which maximizes the likelihood function formed from the data specific to the host and parasite family supplied as arguments to the function. You may also run into issues where the log-likelihood is negative infinity at all \(p\); in this case, the function should return NA.
The skeleton of the function is provided below. The input arguments specify the host species and the parasite family. The cases where an NA should be returned are addressed using nested if {} else {} statements.
Fill in the rest of the function! Test the function on the host and parasite family from class. Does it return the same maximum likelihood estimate for \(p\)? If not, the function needs to be de-bugged. If so, excellent – you are probably in a good position to finish the problem! (1.5 mark)
# make sure remove eval = FALSE for submission## skeleton of functionProbDeathEstimator <-function(host, para_family){ data <- disease_distance %>%filter(# FILL IN)if (nrow(data) ==0) {return(NA) # return NA if there are no observations for this host + para_family } else p_values_to_test <-# FILL IN # p values to search over# FILL IN: Generate a vector of log likelihood values for each p value LogLik <-data.frame(LogLik, p = p_values_to_test)if (all(LogLik$LogLik ==-Inf)){return(NA) } else {# FILL IN - Estimate the maximum likelihood p valuereturn(# FILL IN ) # return the maximum likelihood estimator for prob of death, given function}}
# make sure remove eval = FALSE for submissionProbDeathEstimator(host ="Sus_scrofa", para_family ="Coronavirinae")
Loop over all rows of the Combinations data frame you made in (b). For each row, apply the function you made in (c) to the host and parasite family in that row. Store the values returned by the function, for each row, in a vector ProbabilitiesDeath. At the end of the loop, this vector should have length equal to the number of rows in Combinations (i.e., one estimate for host-parasite family pair). Turn in into a dataframe data.frame(p = ProbabilitiesDeath, Combinations). Each row should have a host, parasite family, and the estimate probability of death for the host when infected with a member of the parasite family (0.75 mark)
Visualize the distribution of all probability of death estimates using a histogram. Then, visualize the distribution for by host (i.e., faceted by host) and by parasite family. (0.5 marks)
What do you notice about the distribution(s) of death probabilities?