The latest stable version of SLOUCH can be installed from the CRAN (Comprehensive R Archive Network) by entering:
install.packages("slouch")
The phylogenetic trees used in SLOUCH are encoded as an object of class phylo
. Consult the package APE (Analysis of Phylogenetics and Evolution, Paradis et al. 2004) for the base functionality, and auxillary packages such as treeio
and ggtree
(Yu et al. 2016) for more modern and extensive functionality for importing, exporting or plotting phylogenetic trees in various formats. For the purposes of illustrating the software, we will use a dataset of ruminant neocortices bundled with the package and a corresponding phylogenetic tree (Toljagić et al. 2017). First, we will organize the neocortex data and associated annotation data.
# Load necessary packages
library(ape)
library(slouch)
## Load the phylogenetic tree with annotation data
data(artiodactyla)
phy <- artiodactyla
## Load the neocortex dataset
data(neocortex)
## Plot the tree
plot(ladderize(phy), cex = 0.6)
Now, we have a phylogenetic tree with corresponding morphological data for the extant species. If you use your own data to fit models, it is recommended to store the data for the terminal branches in a data frame or in a similar data structure. In order to line up the data frame with the tree, SLOUCH requires the species in the data frame need to be in a particular order.
## Check whether they are lined up correctly
neocortex$species == phy$tip.label
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## [13] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Unsurprisingly, not all of the species are in their correct places; we will have to reorder the data frame. Here is one way to do it.
neocortex <- neocortex[match(phy$tip.label, neocortex$species), ]
## Check if they line up again
neocortex$species == phy$tip.label
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The neocortex dataset includes neocortex area, brain mass and body mass in ruminants, primarily bovids. It also includes some ecological information such as the type of habitat (open, closed) or mode of diet (grazer, browser), see ?neocortex
for further reference.
The idea here is to test whether the phylogenetic relationships have an influence on the distribution of a single variable. Most phylogenetic comparative methods will begin with this step. It is important to realize, however, that phylogenetic effects are not necessarily the same thing as phylogenetic inertia. A variable can be seen to have quite strong phylogenetic effects but such a pattern can easily come about if that variable is evolving towards optima associated with niches that themselves exhibit strong phylogenetic effects. Phylogenetic inertia needs to be measured from the residuals of a model that includes predictor variables that may or may not themselves be phylogenetically structured. First, we plot the neocortex-brain allometry:
braincentered <- neocortex$brain_mass_g_log_mean - mean(neocortex$brain_mass_g_log_mean)
plot(x = braincentered,
y = neocortex$neocortex_area_mm2_log_mean,
xlab = "Mean log brain mass (g)",
ylab = "Mean log neocortex area (mm2)")
The way to test for an overall phylogenetic effect in the SLOUCH program is to fit an intercept-only or grand mean model. The program will estimate the phylogenetic half-life \(t_{1/2}\) (\(t_{1/2} = \log(2)/\alpha\)), and the stationary variance \(v_y\) (\(= \sigma^2_y / 2 \alpha\)), using likelihood, and the intercept (\(b_0\)) using generalized least squares. For now we will use the numerical optimizer (the default setting):
model0 <- slouch.fit(phy = phy,
species = neocortex$species,
response = neocortex$neocortex_area_mm2_log_mean)
A minimal overview of model0
can be generated by typing print(model0)
. The output should be observed with caution until we trust that the hillclimber has converged at a global maximum, or by using a fine-grained grid search to accurately estimate \(t_{1/2}\) and \(v_y\).
print(model0)
## Response: neocortex$neocortex_area_mm2_log_mean
##
## Model fit:
## AICc Support R squared
## 7.055e+01 -3.197e+01 1.652e-16
##
## ML estimate(s):
## Phylogenetic half-life: 1648233.8309
## Stationary variance: 27047.2632
##
## Coefficients:
## (Intercept)
## 9.759
By entering summary(model0)
we get a more detailed summary of the model output. It displays the best estimates of all the parameters where support for the regression parameters are given as standard errors, and log-likelihood values as well as various information criteria for the best estimate model-fit.
summary(model0)
## Important - Always inspect the likelihood surface of the model parameters with
## grid search before evaluating model fit & results.
##
## Maximum-likelihood estimates
## Estimate
## Phylogenetic half-life 1648234
## Stationary variance 27047.26
##
## Inferred maximum-likelihood parameters
## Value
## Mean phylogenetic correction factor 5.717194e-06
## Rate of adaptation 4.205393e-07
## Diffusion variance 0.02274888
##
## Intercepts
## Estimates Std. error
## (Intercept) 9.759029 0.3679559
##
## Model fit summary
## Values
## Support -3.20e+01
## AIC 6.99e+01
## AICc 7.05e+01
## SIC 7.52e+01
## R squared 1.65e-16
## SST 4.30e+01
## SSE 4.30e+01
## N (params) 3.00e+00
The phylogenetic half-life parameter (\(t_{1/2} = \log(2) / \alpha\)) measures the influence of the ancestral state of the variable in question relative to the tendency to evolve towards the common ancestral state (the intercept). Conversely, \(\alpha\) measures the rate of adaptation. If the best estimate of \(t_{1/2}\) is 0, the ancestral state does not influence the current state of the variable. The larger \(t_{1/2}\) gets, the more influence the past state of the variable has on its current state (i.e. the trait‘s evolution approaches a Brownian motion as \(t_{1/2}\) approaches infinity). The units of the phylogenetic half-lives are the same units as the branch lengths in the phylogenetic tree, phy$edge.length
. The total depth, or distance from the root, can for all nodes be calculated with node.depth.edgelength(phy)
. For this phylogenetic tree the maximum tree depth is about 27 million years.
The parameters we estimate for the models that have a single random predictor variable are: \(t_{1/2}\), \(\sigma_x^2\), \(v_y\), and the regression parameters \(b_i\). Recall that the regression parameters \(b_i\) can be given in one of two ways, as an evolutionary regression or as an optimal regression where the latter is “corrected” by the phylogenetic correction factor. The predictor variance, \(\sigma_x^2\), is estimated a priori by SLOUCH. The estimation procedure itself is performed in a similar manner as for the intercept-only models above. For example, if we wanted to perform a regression of log neocortex size (\(\text{mm}^2\)) on log brain mass (g), we would enter:
model3 <- slouch.fit(phy = phy,
hl_values = 1,
vy_values = 0.05,
species = neocortex$species,
response = neocortex$neocortex_area_mm2_log_mean,
random.cov = braincentered)
model3
## Response: neocortex$neocortex_area_mm2_log_mean
##
## Model fit:
## AICc Support R squared
## -14.5016 11.7771 0.9145
##
## ML estimate(s):
## Phylogenetic half-life: 2.5219
## Stationary variance: 0
##
## Coefficients:
## (Intercept) braincentered
## 9.668 0.989
Comparative analyses based on species averages should consider the estimation error in these averages as measurement error. This is particularly pressing in fields such as evolutionary physiology, where the measurements of individual organisms may be laborious and expensive. Obtaining many measurements from many individuals from many species is difficult, and one often ends up with sample sizes that are small and uneven across species. In such a situation the variance attributable to measurement error can be a substantial fraction of the total, and one wants to weigh the species data according to their reliability. It is also possible that measurement variance may generate a downward bias in estimates of phylogenetic effects, because it makes species appear less statistically correlated than they are in reality. As discussed above, SLOUCH can incorporate measurement variance in both response and predictor variables.
For the neocortex data, estimates of measurement variance can be obtained as the square of the standard error of the species means. There is, however, a practical difficulty in that small sample sizes also makes for unreliable estimates of the measurement variance; the standard error of a species average obtained from a handful of individuals is so inaccurate as to be worthless. We therefore adopted the procedure of assuming that the within-species variance of each variable was the same for all species. The within-species variance estimated average of the sample variances of each variable was estimated as a sample-size-weighted average of the sample variances of each species; i.e. as
\[ \sigma^2_w = \frac{\sum_i \sigma_{wi}^2 (n_i - 1)}{\sum_i (n_i - 1)} \]
where \(\sigma_{wi}^2\) is the sample variance of species \(i\), and \(n_i\) is the sample size of species \(i\). In this way, the larger sample sizes are weighted more. We then estimated the measurement variance of each species as \(\sigma_w^2 / n_i\). See Grabowski et al. (2016) for further discussion. In order to incorporate measurement variance in the model, would enter:
model3 <- slouch.fit(phy = phy,
species = neocortex$species,
response = neocortex$neocortex_area_mm2_log_mean,
mv.response = neocortex$neocortex_se_squared,
random.cov = braincentered,
mv.random.cov = neocortex$brain_se_squared)
plot(x = braincentered,
y = neocortex$neocortex_area_mm2_log_mean,
xlab = "Mean log brain mass (g)",
ylab = "Mean log neocortex area (mm2)")
abline(model3$beta_evolutionary$coefficients_bias_corr[,1],
col = "black", lwd = 2)
abline(model3$beta_primary$coefficients_bias_corr[,1],
col = "orange", lwd = 2)
While the single-optimum model showed a strong phylogenetic signal, this model exhibits much less phylogenetic inertia, with best estimate of the phylogenetic half-life (\(t_{1/2}\)) being 1 myr. Here, the optimal regression is steeper than the evolutionary regression. It is also possible to fit a model with multiple continuous covariates, however the input to random.cov
must be a matrix or data frame that has column names, and the observational error passed to mv.random.cov
must be a matrix or data frame of the same shape as random.cov
.
bodycentered <- neocortex$body_mass_g_log_mean - mean(neocortex$body_mass_g_log_mean)
model4 <-
slouch.fit(phy = phy,
species = neocortex$species,
response = neocortex$neocortex_area_mm2_log_mean,
mv.response = neocortex$neocortex_se_squared,
random.cov = cbind(braincentered,
bodycentered),
mv.random.cov = cbind(neocortex$brain_se_squared,
neocortex$body_se_squared))
model4
## Response: neocortex$neocortex_area_mm2_log_mean
##
## Model fit:
## AICc Support R squared
## -13.3092 12.4654 0.9232
##
## ML estimate(s):
## Phylogenetic half-life: 2.0342
## Stationary variance: 0
##
## Coefficients:
## (Intercept) braincentered bodycentered
## 9.67005 0.90921 0.03934
The slouch.fit
function will on default estimate the intercept \(k\). If the phylogenetic tree is non-ultrametric, for example due to the inclusion of extinct species, it is possible to estimate the components of \(k\). Recall that, when \(y\) is evolving according to an Ornstein-Uhlenbeck process in response to one or more predictors \(x\) evolving as Brownian motions, the intercept \(k\) is
\[ k = e^{-\alpha t}y_a + (1 - e^{- \alpha t})b_0 + (1 - e^{-\alpha t} - \rho(\alpha t))(b_1x_{a1} + b_2x_{a2} + \dots) \]
SLOUCH can independently estimate \(y_a\), \(b_0\) and the sum \(bx_a = (b_1x_{a1} + b_2x_{a2} + \dots)\). Using the same example with neocortex evolving in response to brain size, we would specify:
model5 <- slouch.fit(phy = phy,
species = neocortex$species,
response = neocortex$neocortex_area_mm2_log_mean,
mv.response = neocortex$neocortex_se_squared,
random.cov = braincentered,
mv.random.cov = neocortex$brain_se_squared,
estimate.Ya = TRUE,
estimate.bXa = TRUE)
The parameters \(y_a\) and \(bx_a\) represent the ancestral states for \(y\) and \(x\) separate from the regression intercept \(b_0\). Since this phylogenetic tree is ultrametric, we cannot recover independent estimates of these. If we would try to execute the above code, we would not be able to estimate the GLS coefficients since the model matrix becomes singular. Even if we had a non-ultrametric tree, the intercept components are often estimated with extremely low power, so it can make sense to estimate them as a combined intercept term. This is done by default, or by specifying slouch.fit(..., estimate.Ya = FALSE, estimate.bXa = FALSE)
in the function call. The phylogenetic residual covariance matrix will always be computed based on the phylogenetic tree, whether it is ultrametric or not. In some cases (for example, when inertia is small), estimating the components of \(k\) will not work (due mainly to numerical issues because of unstable coefficient and parameter combinations in the intercept terms and non-convergence of regression parameters). Also note that, in the non-ultrametric case, each species theoretically has its own optimal intercept (\(b_0\)), however the reported estimate is actually an average of these. Its primary purpose is to allow us to plot a regression line.
SLOUCH can fit models with multiple adaptive regimes or niches over the branches of the phylogenetic tree. We will fit neocortex size as a function of diet in ruminants. Trees in the phylo
format are represented by the edges found in phy$edge
, where each edge connects two vertices or nodes. All of the tip nodes have indices starting from 1, 2, 3 … until \(n_{tips}\), in this case 43. The root node has index \(n_{tips}\)+1, here 44, and the rest of the internal nodes have indices (\(n_{tips}\)+2, \(n_{tips}\)+3, …, \(n_{nodes}\)). When running this type of model, we will need to specify the internal adaptive regimes in the order of node indices (\(n_{tips}\)+1, \(n_{tips}\)+2, \(n_{tips}\)+3, …, \(n_{nodes}\)). The regimes for the tips must be supplied to the fixed.fact
argument (slouch.fit(..., fixed.fact = neocortex$diet)
), and the regimes for the internal nodes must be assigned to phy$node.label
. In order to plot and visually verify that the ancestral state configuration is sensible, we need to have all the regimes in the order of the edges, not the nodes.
## Inspect the internal node regimes
## These have order n+1, n+2, n+3 ...
internal_regimes <- factor(phy$node.label)
## Concatenate tip and internal regimes. These will have order 1,2,3 ...
regimes <- c(neocortex$diet, internal_regimes)
## Pick out the regimes of the edges, in the order of phy$edge
edge_regimes <- factor(regimes[phy$edge[,2]])
plot(phy,
edge.color = c("Black", "Orange", "blue")[edge_regimes],
edge.width = 3, cex = 0.6)