13 Multivariate statistics

library(tidyverse)
library(rgl)
library(ggfortify)

options(rgl.useNULL = TRUE) 
rgl::setupKnitr(autoprint = TRUE)

13.1 Lesson preamble:

13.1.1 Lesson objectives

Be able to picture data as a multidimensional cloud

Gain familiarity with the terms, “trait space”, “morphospace”, “pairs plot”, “PC axes”, “cumulative variation explained”, “biplot”, and “scree plot”

Verbally describe the process behind a PCA

Interpret the output from a PCA

Identify when the output from a PCA is robust… or not

13.1.2 Lesson outline

Introduce Pigot et al. 2020 data and motivate class with a scientific question

Explore trait data in 2D and 3D while projecting example species into trait space

Introduce PCA

Run PCA

Interpret PCA

Challenge questions

13.2 Background information

Most things in the natural world are inter-related and correlated, where many trends and observations tend to change in a predictable way together.

We use multivariate techniques to reduce complex situations into more manageable ones and to deal with the fact that patterns are correlated with other patterns, which are in turn driven by a number of interacting ecological processes.

Multivariate statistics is a family of statistical tools used to to handle situations where the predictor and/or the response variables come in sets, allowing us to distill through some of that correlated (and sometimes redundant) information and to analyze these variables simultaneously. These tools are also useful for identifying substructure or groupings of your data.

Common biological questions that multivariate statistics can be used to answer:

Do sites with similar disturbance regimes (history of fire, nutrient fertilization, soil chemicals) host similar biological communities? To answer this question, one could use a PCA, NMDS, PCoA).
What is the effect of a set of environmental variables (temperature, salinity, current strength) on the composition of fish communities? To answer this question, one could use a RDA).
Do communities in different habitats–with different structural complexity and environments– have different functional representation in trait space? To answer this question, one could use a PCA, NMDS, PCoA).

13.3 Bird morphology data

Today we will be introducing the world of multivariate statistics by working with a published data set from (Pigot et al. 2020). In this paper, the authors are asking if morphology is linked to ecological function in birds. We might expect this to be true if species have evolved traits (in this case morphological traits) to optimize their fitness in a specific biotic and abiotic environment. In this way, traits can be used to link species to theories of community assembly, niche partitioning, and local adaptation.

Thus we might expect that birds that have occupy specialized niches–say the pollinators or the deep sea divers–might have also evolved specialized morphology in order to exploit these niches. If this is true, we might also see that phylogenetically unrelated species, that occupy similar niches, might have evolved similar traits via convergent evolution.

Today, we will be simplying the Pigot et al. (2020) analysis and to ask two questions among birds of the neotropics:

Do birds that are more closely related (eg. within order) tend to have similar trait values ie occupy the same regions of morphospace (indication of phylogenetic inertia)?
Do birds that occupy similar trophic niches have similar traits, as we might expect from trait-based theories of ecology and convergent evolution? While we will not be performing phylogenetic analyses in class, and so can’t explicitly test for the signal of convergent evolution, Pigot et al. (2020) do tackle this in the published paper.

We will be using the morphology data from Pigot et al. (2020), who measured 9 traits (8 morphology + mass) on museum specimens. These traits are previosly known to be important determinants of bird ecomorpholgy.

Extended Figure 1, Pigot et al. 2020. a, Resident frugivorous tropical passerine (fiery-capped manakin, Machaeropterus pyrocephalus) showing four beak measurements: (1) beak length measured from tip to skull along the culmen; (2) beak length measured from the tip to the anterior edge of the nares; (3) beak depth; (4) beak width. b, Insectivorous migratory temperate-zone passerine (redwing, Turdus iliacus) showing five body measurements: (5) tarsus length; (6) wing length from carpal joint to wingtip; (7) secondary length from carpal joint to tip of the outermost secondary; (8) Kipp’s distance, calculated as wing length minus first-secondary length; (9) tail length. Illustration by Richard Johnson.

# read in bird data
birds <- read_csv("data/Pigot2020.csv")

Rows: 9963 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): Binomial, Realm, TrophicLevel, TrophicNiche, ForagingNiche, Family...
dbl (14): MinLatitude, MaxLatitude, CentroidLatitude, CentroidLongitude, Ran...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# subset to neotropics so we have a more tractable dataset for class
birds <- subset(birds, Realm=="Neotropic")

# explore and look at columns
head(birds)

# A tibble: 6 × 23
  Binomial    Realm TrophicLevel TrophicNiche ForagingNiche Family Order Habitat
  <chr>       <chr> <chr>        <chr>        <chr>         <chr>  <chr> <chr>  
1 Abeillia_a… Neot… Herbivore    Nectarivore  Nectarivore … Troch… Apod… Forest 
2 Aburria_ab… Neot… Herbivore    Frugivore    Fugivore gle… Craci… Gall… Forest 
3 Acanthidop… Neot… Omnivore     Invertivore  Invertivore … Ember… Pass… Forest 
4 Accipiter_… Neot… Carnivore    Vertivore    Vertivore pe… Accip… Acci… Woodla…
5 Accipiter_… Neot… Carnivore    Vertivore    Vertivore pe… Accip… Acci… Woodla…
6 Accipiter_… Neot… Carnivore    Vertivore    Vertivore pe… Accip… Acci… Forest 
# ℹ 15 more variables: PrimaryLifestyle <chr>, MinLatitude <dbl>,
#   MaxLatitude <dbl>, CentroidLatitude <dbl>, CentroidLongitude <dbl>,
#   RangeSize <dbl>, Beak.Length_Culmen <dbl>, Beak.Length_Nares <dbl>,
#   Beak.Width <dbl>, Beak.Depth <dbl>, Tarsus.Length <dbl>, Wing.Length <dbl>,
#   Secondary1 <dbl>, Tail.Length <dbl>, Mass <dbl>

# need to log transform bc morphological data are often not normally distributed
par(mfrow=c(1,2))
hist(birds$Beak.Length_Culmen, main="Beak.length")
hist(birds$Mass, main="Mass")

par(mfrow=c(1,1))

# log transform and overwrite original columns
birds[, 15:23] <- log(birds[ , 15:23])
nrow(birds) # each row is a species

[1] 3566

13.4 Exploring trait space

We want to identify if different groupings (eg orders, trophic niche) of species have more or less similar traits.

How might we go about doing this?

Ideally, we would construct a 9 dimensional morphospace, where each dimension corresponds to one of our traits… but humans can’t visualize above 3 dimensions. So lets start by visualizing the relationships between our 9 traits, ie dimensions, in two dimensions using pairwise plots.

13.4.1 2D

We can look at the pairwise relationships between traits in bivariate plots to understand how different traits are correlated with each other. Remember from previous lectures that high multicolinearity can introduce problems in univariate statistics, we can see that here. across traits

# all highly correlated
# each point here is a species
par(mfrow=c(1,2))
plot(birds$Beak.Depth, birds$Beak.Width)
plot(birds$Wing.Length, birds$Mass)

par(mfrow=c(1,1))

# can look at all variables together
# can see that all morphological traits are highly linearly and positively correlated
# ie species with large masses ALSO have high wing length and high tarsus length (lower leg bone, the bone below what looks like a backwards facing knee)
pairs(birds[, 15:23])

# lets look at where some example species fall out on here
# Anthracothorax_mango =  hummingbird from Jamaica
# Harpia_harpyja = Harpy eagle 
# Junco_vulcani = Volcano Junko from costa rica and panama
cols <- ifelse(birds$Binomial== "Anthracothorax_mango", "darkorange" , 
               ifelse(birds$Binomial== "Harpia_harpyja", "darkblue", 
                      ifelse(birds$Binomial== "Junco_vulcani", "#009900", 
                             alpha("grey80", .01))))

pairs(birds[, 15:23], col=cols, pch=19)

What can we conclude from this figure about the relative morphological similarity/dissimilarity of the species?

13.4.2 3D

Now let’s expand this visualization to three dimensions–we still can’t capture the full 9 dimensions of our data in such a way that we can visualize it, but seeing the 3D plot can help us get a flavour of how our species are spread through morphospace.

Generally 3D plots are frowned upon as a data visualizing tool–they are complicated to interpret and there are usually simpler and clearer ways to show relationships between your variables in a manuscript, presentation, or talk. In this case, these 3D plots are a useful tool for illustrating the idea of higher dimensional data and the concepts of multivariate statistics.

# lets pick 3 dimensions or traits to visualize, beak.Width, Wing.Length and Mass
# rotate plot to see how each variable are related to eachother
plot3d(birds$Mass, birds$Beak.Width, birds$Wing.Length, col=alpha("grey80", .01))

Which dimensions have the tighter relationship? How can you tell?

Along which dimension is there the most variation?

Lets now add our 3 exemplar species to see how they fall in this morphospace.

plot3d(birds$Mass, birds$Beak.Width, birds$Wing.Length, col=alpha("grey80", .01))
points3d(subset(birds, Binomial=="Anthracothorax_mango" | Binomial=="Harpia_harpyja" |Binomial=="Junco_vulcani")$Mass, 
         subset(birds, Binomial=="Anthracothorax_mango" | Binomial=="Harpia_harpyja" |Binomial=="Junco_vulcani")$Beak.Width, 
         subset(birds, Binomial=="Anthracothorax_mango" | Binomial=="Harpia_harpyja" |Binomial=="Junco_vulcani")$Wing.Length, size=10, col=c("darkorange", "darkblue", "#009900"))

How would you describe where these 3 species fall out in morphospace? Where is the centre of morphospace?

How could we quantify this? Remember that we have 3 dimensions (traits) here, but really we’re interested in all 9. And it is not uncommon to have even more.

13.4.3 Euclidean distance

The solution is the Euclidean distance - which describes the closeness of points in a multidimensional space. If your data aren’t numeric and continuous, other notions of distance can be used in its place.

13.5 PCA: Principle Components Analysis

Rather than going through each set of traits and asking visually how they are correlated and where our species fall out across these highly correlated traits, we can run a PCA, which does all of this for us.

A PCA (Principle components analysis) is one of the most common ordination techniques used in EEB. They are really popular in the community ecology literature, the functional morphology literature, and population genetics. They can be used as an exploratory tool, intermediate step in an analysis, or the final product of an analysis. There are also lots of variations that deal with phylogenetic structure or data that violates the assumptions (eg categorical data).

PCAs work by identifying a new coordinate system (PC axes) that best account for the variation in a cloud of data across multiple dimensions. This process produces a set of “summary axes” that explain the variation in the data. Ideally, this has the effect of reducing the dimensionality of the data. In our example data in class, this would mean that we could describe the how our species are spread across trait space with less than 9 trait axes. This can be useful for summarizing information, understanding how traits are correlated with each other, provide simpler visualizations (ie in less dimensions), or can be necessary for downstream analyses.

What this means is the a PCA is simply a rotation of your data–it doesn’t change the relative relationships between species or how they are distributed through trait space. It simply provides new axes through which to look at the relationships between the data (eg. species) and the dimensions (eg. traits).

Importantly, a PCA assumes a linear relationship between your variables, can only take quantitative data, and requires more rows than columns in the data.

The procedure to conduct a PCA is as follows:

Imagine your data contains n objects and p variables. The n objects can be represented as a cluster of points in the p-dimensional space.

Imagine a 9-dimensional object with each of our trait measurements as an axis (k=9). We’ve already visualized this in 3 dimensions.

Plot our species (n=3566) in this 9D space such that we are representing our data as cloud in high dimension space. Identify the centre of trait space using euclidean distances.

The cluster of data in trait space is generally not spheroid: it is elongated in some directions and flattened in others (recall our bird data in 3 dimensions). This step fits a line through the data points in multidimensional space that goes through the centre of trait space and that explains the most variance in our data. This is the first PC axis.

Recall, in our 3D toy example, the major axis of variation is between mass and wing length. To actually calculate the math to add this line to the 3D figure, we would have to do the math behind a PCA or simply run the PCA and plot the resultant line… let’s do that now in order to visualize the process.

mean_xyz <- apply(birds[, c(23, 17, 20)], 2, mean)
xyz_pca   <- prcomp(birds[,c(23, 17, 20)], scale.=F)
dirVector1 <- xyz_pca$rotation[, 1]  # PC1
plot3d(birds$Mass, birds$Beak.Width, birds$Wing.Length, col=alpha("grey80", .01), size=1)
#abclines3d(mean_xyz, a = t(dirVector1), col="darkorange", lwd=2) # mean + t * direction_vector of PC1
lines3d(c(1,9),  
        c(.5,3.5), 
    c(3.5, 6.5),
        col="darkorange", lwd=8)

Find a second line through the data under the condition that it also goes through the centre of trait space and is orthogonal to the first axes (i.e., at a 90° angle). This means that these two axes are uncorrelated. This will be our second PC axis.

Visual representation of PC1 and PC2 in 2 dimensions.

Note how all principle components are perpendicular (orthogonal) to each other.

This process continues until we have as many PC axes as we did starting dimensions. So in the case of our bird trait data, that would be 9.

13.5.1 Run PCA

Now that we’ve conceptually visualized the process of what a PCA is doing, let’s run it in R.

We will use the function prcomp–there are many functions that calculate a PCA in R that all vary slightly, but generally give consistent results.

At its most basic, prcomp takes two variables, the trait data and scale = TRUE/FALSE. The scale command specifies whether the PCA will run on a correlation or a covariance matrix. Choosing scale = TRUE selects a correlation matrix which standardizes your data to have unit variance. This puts all your variables on the same scale and is necessary if your data contain multiple types of units (eg mass measured in grams and wing length measured in cm).

# just the trait data, scale = TRUE puts all trait values on unit variance, use when measure in multiple units, makes traits directly comparable
pca.out <- prcomp(birds[, 15:23], scale.=T)

13.5.2 Interpret PCA output

Now let us interpret the PCA object.

summary(pca.out)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.5795 1.0550 0.68273 0.62615 0.46602 0.26437 0.19886
Proportion of Variance 0.7393 0.1237 0.05179 0.04356 0.02413 0.00777 0.00439
Cumulative Proportion  0.7393 0.8630 0.91478 0.95834 0.98247 0.99023 0.99463
                          PC8     PC9
Standard deviation     0.1723 0.13658
Proportion of Variance 0.0033 0.00207
Cumulative Proportion  0.9979 1.00000

This gives us a summary in table format of variance explained by each axes. This tells us how good a job each PC axis is doing at explaining the variation across our data. Note the second row (Proportion of Variance) that tells us how well each individual PC axis is doing while the third row (Cumulative Proportion) simply adds the individual contributions of variance explained by each axis to tell us cumulatively how much variation we are explaining with the addition of each new PC axis. What we want to see here is that the first several axes are explaining the majority of the variation in the data. If the first PC axis is only explaining 5% of the variation in the data, that means that the PCA is not describing the relationships between your data very well and any interpretations of the output won’t be meaningful.

How much variation is explained cumulatively by the first 3 PC axes in our data?

Let’s have a look at this object in another way.

str(pca.out)

List of 5
 $ sdev    : num [1:9] 2.58 1.055 0.683 0.626 0.466 ...
 $ rotation: num [1:9, 1:9] 0.309 0.27 0.343 0.349 0.308 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:9] "Beak.Length_Culmen" "Beak.Length_Nares" "Beak.Width" "Beak.Depth" ...
  .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:9] 3.03 2.55 1.63 1.75 3.03 ...
  ..- attr(*, "names")= chr [1:9] "Beak.Length_Culmen" "Beak.Length_Nares" "Beak.Width" "Beak.Depth" ...
 $ scale   : Named num [1:9] 0.508 0.59 0.578 0.687 0.645 ...
  ..- attr(*, "names")= chr [1:9] "Beak.Length_Culmen" "Beak.Length_Nares" "Beak.Width" "Beak.Depth" ...
 $ x       : num [1:3566, 1:9] -5 5.15 -1.13 3.91 3.38 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"

We see that this object is structured as a five-item list. We mostly care about two of these items:

Loadings – appropriately named rotation – these identify the PC axes (ie rotations) around which the data are organized.
Species scores – lazily named x – these are the coordinates in PC space for each of the species in our data and tell us where in PC space each species falls.

head(pca.out$x) # this gives us the particular location (ie the coordinates) of each of our species in the multidimensional space formed by the PC axes

           PC1       PC2        PC3        PC4         PC5         PC6
[1,] -4.996801 -1.172868  0.8622206 -0.1550225 -0.23109589  0.81003047
[2,]  5.150912  1.272725  1.5778845  0.5593605 -0.23613418 -0.12045828
[3,] -1.133455  0.200266 -0.1704813 -0.2639140  0.31336225  0.24707581
[4,]  3.914509  1.267083  0.5104558  0.4590751  0.13288949  0.06398866
[5,]  3.376036  1.472642  0.7392281  0.3893921 -0.05490466  0.23730346
[6,]  2.574368  1.615100  0.5074203  0.1862374 -0.17974063  0.20475730
             PC7        PC8         PC9
[1,] -0.08561238  0.4712667 -0.01224030
[2,]  0.16270838 -0.2759536  0.05778076
[3,] -0.15454788  0.0498536  0.03989184
[4,] -0.20406768  0.1652683  0.02280116
[5,]  0.05735235  0.2638578  0.12172863
[6,]  0.28287395  0.3047981  0.18268409

head(pca.out$rotation) # this tells us which of our traits  are contributing most strongly to each PC axis (ie which variables are being summarized by each newly created PC axis)

                         PC1         PC2          PC3        PC4         PC5
Beak.Length_Culmen 0.3092990 -0.53227660  0.188578600 -0.2181403  0.09158808
Beak.Length_Nares  0.2696209 -0.66835973  0.001967864 -0.1106845  0.15491394
Beak.Width         0.3426905  0.03999976 -0.610281688  0.1667968 -0.01501396
Beak.Depth         0.3490122  0.03462108 -0.564305388  0.1015702  0.05454900
Tarsus.Length      0.3076452  0.39691184  0.140452964 -0.5812005  0.47122204
Wing.Length        0.3589849  0.04897339  0.291811470  0.1331052 -0.60580349
                           PC6        PC7         PC8         PC9
Beak.Length_Culmen -0.11820103  0.1180164 -0.12558811 -0.69699967
Beak.Length_Nares   0.05223567 -0.1744716  0.03992508  0.63998724
Beak.Width         -0.66324179  0.1790129  0.08765516  0.02922599
Beak.Depth          0.68964479 -0.1426081  0.08479110 -0.20523806
Tarsus.Length      -0.04341410  0.0149411  0.40423045  0.05228672
Wing.Length        -0.07557215 -0.3876020  0.48903369 -0.06133691

We can interpret these outputs directly, specifically the rotations.

pca.out$rotation[ ,1:4] # let's just look examine the first 4 PC axes

                         PC1         PC2          PC3         PC4
Beak.Length_Culmen 0.3092990 -0.53227660  0.188578600 -0.21814032
Beak.Length_Nares  0.2696209 -0.66835973  0.001967864 -0.11068447
Beak.Width         0.3426905  0.03999976 -0.610281688  0.16679682
Beak.Depth         0.3490122  0.03462108 -0.564305388  0.10157019
Tarsus.Length      0.3076452  0.39691184  0.140452964 -0.58120050
Wing.Length        0.3589849  0.04897339  0.291811470  0.13310521
Secondary1         0.3626428  0.28972162  0.083292817 -0.07918361
Tail.Length        0.3192785  0.09456601  0.380946021  0.71437740
Mass               0.3680056  0.11983129  0.128885207 -0.17294488

The way we interpret this is by looking, within each PC axis, for the variables that have the largest absolute values. These are the variables that are describing the most variation along this axis.

For example, all the traits along PC1 are positively correlated with each other (all values have the same sign). PC1 is a axis where at large values of PC1 species have long wings, big masses, long tails etc. PC2 is slightly more interesting to interpret: it is a beak length axis vs tarsus length. This suggests that species that have long beaks tend to also have a short tarsus (leg bone).

Which traits drive variation along PC3?

We can visualize these relationships using the base R function biplot.

par(mar=c(5,5,5,5))
biplot(pca.out)

On this figure, there are 2 sets of information. Each species is plotted (these are the numbers). As well, the variable loadings are represented as arrows and tell us how each variable is contributing to each axis. The clustering of species along the axes (and rotations) tell us which species are similar in terms of their traits.

We can do this with the function autoplot in the ggfortify package, which takes advantage of ggplot style plotting.

autoplot(pca.out)

The simplest version of autoplot displays less information than our biplot function. On this figure we can see that we are plotting PC1 by PC2 and each point on the figure is each of the rows aka species in our data. autoplot handily also shows us the proportion of variance explained by each of these PC axes. Now lets add the loadings.

# lets add our 3 example species to see where they land in PC space
cols <- ifelse(birds$Binomial== "Anthracothorax_mango", "darkorange" , 
               ifelse(birds$Binomial== "Harpia_harpyja", "darkblue", 
                      ifelse(birds$Binomial== "Junco_vulcani", "#009900", 
                             alpha("grey30", .1))))

autoplot(pca.out, x=1,y=2,loadings=TRUE, loadings.label=TRUE, 
         loadings.colour="grey30", 
         colour=cols, size=3) + 
  theme_bw()

Based on where these species fall out in PC space, what can we conclude about the trait values of the hummingbird vs the junko vs the eagle?

The next step in the PCA workflow is to identify the important PC axes that give meaningful information about the data.

We do this with a scree plot. This shows us the amount of variation explained by each axis. As we know, each axis explains subsequently less variation. We use scree plots to help us determine at which point the axes stop adding much new information. We can do this visually or quantitatively. We will just go over the visual method as determining which axes to keep is pretty subjective and depends on your system and question of interest.

plot(pca.out)

Each bar is a PC axis, starting from PC 1. In a scree plot, what you want to look for is a big jump between subsequent axes. It’s subjective, but based on this scree plot I would probably be inclined to keep the first 2 PC axes. As well, we can look at the cumulative proportion of explained variance:

summary(pca.out)$importance[,1:6]

                            PC1      PC2      PC3       PC4       PC5       PC6
Standard deviation     2.579503 1.055004 0.682726 0.6261508 0.4660172 0.2643707
Proportion of Variance 0.739310 0.123670 0.051790 0.0435600 0.0241300 0.0077700
Cumulative Proportion  0.739310 0.862990 0.914780 0.9583400 0.9824700 0.9902300

We can see that by PC2 we have explained 86% of the variation in the data, which is pretty good. By the time we get to PC4, each individual axis is explaining less than 10% of variation in the data, so adding them, and conversely excluding them, isn’t going to have a large effect.

13.6 Challenge questions

13.6.1 Answer biological questions posed at start of class

Recall the way we started this lecture, we want to be able to answer if:

Do birds that are more closely related (eg. within order) tend to have similar trait values ie occupy the same regions of morphospace (indication of phylogenetic inertia)?
Do birds that occupy similar trophic niches have similar traits, as we might expect from trait-based theories of ecology and convergent evolution?

To answer these questions let’s return to our autoplot and add some convex hulls that define different groupings of our data. First, we will group the species by their higher level taxonomic structure (order) and then by ecology (trophic niche).

# answers first question
autoplot(pca.out, x=1,y=2,
         data=birds, 
         colour="Order",
         frame=TRUE) +
  theme_classic()

# answers second question
autoplot(pca.out, x=1,y=2,
         data=birds, 
         colour="TrophicNiche",
         frame=TRUE) +
  theme_classic()

What can we conclude from these figures?

13.6.2 PCA limitations and strengths

In class we’ve discussed some of the assumptions behind a PCA and how we interpret a PCA. Is there any type of data for which a PCA might not be appropriate? What are some examples of a biological research question for which a PCA might NOT be appropriate, even if the data is conceivably multidimensional? What are some of the drawbacks of using a PCA? When might you want to consider using a PCA vs not?

13.6.3 PCA in the literature

PCAs are really common across lots of disciplines and they are very common to encounter when reading papers. Let’s look at some real examples of published PCAs and interpret them. Are they robust? Do you trust the authors conclusions?

13.6.3.1 Pigot et al. 2020 example

What can we conclude from this figure?

Figure 6 from Pigot et al. 2020. Clustered points along each principal coordinate axis (PCoA) show the relative morphological similarity between trophic niches from different ecological theatres (biogeographical realms) based on the average pairwise distance between species (n=9,963 species) in nine-dimensional morphospace (see Methods). Individual trophic niches are omitted where they are absent or rare (<6 species) within particular biogeographic realms (Na, Nearctic; Ne, Neotropic; Pa, Palaearctic; Af, Afrotropic; Ma, Madagascar; In, Indo-Malaya; Oc, Oceania; Au, Australasia; An, Antarctic). Bird images reproduced with permission from HBW, Lynx Edicions.

13.6.3.2 Microplastics example

In this paper, the authors seek to identify if beaches along the Caribbean coast of Colombia vary in the quantity and type (fibers, pellets, fragments, or foam) of microplastic (MPs) pollution.

Group A - beaches in tolerable conditions with a “Moderate” MPs presence
Group B - beaches in bad conditions with a “High” MPs presence
Group C - beaches in extremely bad conditions due to “high” and “very high” MPs densities

Focusing on the bottom left panel, what do you think of the authors groupings into these 3 categories?

Figure 7 from Rangel-Buitrago et al. 2021. Groups by similarity levels between beaches and MP shapes presented by use of Agglomerative Hierarchical Clustering (AHC) and Principal Components Analysis (PCA). These analyses allow separation into three groups of beaches. The PCA scatterplot shows factors F1 and F2 as responsible for almost 81% of the total variance.

13.6.3.3 Population genetics example

The authors in this paper, “Getting genetic ancestry right for science and society”, argue that in a human genetic context, PCAs can do more harm than good when artificial categories are applied to continuous data. For example, they show a PCA plot of human genetic ancestry–defined by migration and high mixing–and attempt to impost discrete clusters of continental origin on the PC space.

Based on the figure, do you think that imposing categories on this data is reasonable? Why or why not? What does this suggest about when it might be (in)appropriate seek classifications of via a PCA?

Figure 1 from Lewis et al. 2023. The continuous, category free, nature of genetic variation. Colored dots (N=4149) are reference panel individuals from 87 populations representing ancestry from 7 continental or subcontinental regions projected onto the first two principal components of genetic similarity. Gray dots (N=31705) are participants from BioMe, a diverse biobank based in New York City. Clearly delineated continental ancestry categories, the islands of color, are shown to be a by-product of sampling strategy. They are not reflective of the diversity in this real-world dataset, made evident by the continuous sea of gray. Reproduced/modified from Olalde I et al., Nature 555, 190–196 (2018).

13.7 Further reading

Chapter 5 of Numerical Ecology with R by Borcard, D., Gillet, F. and Legendre, P. (2011).
PCA: a visual explanation
https://setosa.io/ev/principal-component-analysis/
PCA explained using a metaphor with wine, dinner, grandmothers, and animations
https://stirlingcodingclub.github.io/ordination/#wasps