Simulate data for a phylogenetic GLMM

David Wells 12/07/2020

Home page

Simiulate data for a phylogenetic generalised linear mixed model (GLMM) and fit that model to assess accuracy of estiamtes. A phylogenetic GLMM is a comparative method which accounts for similarity between related species. It is very similar to an "animal model" which uses a pedigree rather than a phylogeny to account for similarity between related individuals.

The phylogenetic GLMM fits a random effect which takes a separate value for each species but which follows a correlation structure based on the phylogeny. I.e. the random intercept is more closely correlated for more closely related species.

A lot of this borrows from chapter 6 in the MCMCglmm course notes (which is incomplete).

In [28]:
packages <- c("MCMCglmm", "dplyr", "ggplot2", "ggtree", "diosR", "ape", "MASS")
lapply(packages, require, character.only=T)
options(repr.plot.width=6, repr.plot.height=4)
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /home/david/miniconda3/lib/R/lib/

[1] C.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggtree_2.0.2     plot.matrix_1.4  MASS_7.3-51.5    diosR_0.0.0.9000
 [5] ggplot2_3.3.1    dplyr_1.0.0      MCMCglmm_2.29    ape_5.3         
 [9] coda_0.19-3      Matrix_1.2-18   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4          BiocManager_1.30.10 pillar_1.4.3       
 [4] compiler_3.6.1      base64enc_0.1-3     tools_3.6.1        
 [7] digest_0.6.25       uuid_0.1-4          tidytree_0.3.2     
[10] gtable_0.3.0        jsonlite_1.6.1      evaluate_0.14      
[13] lifecycle_0.2.0     tibble_3.0.1        nlme_3.1-145       
[16] lattice_0.20-40     pkgconfig_2.0.3     rlang_0.4.6        
[19] IRdisplay_0.7.0     rvcheck_0.1.8       IRkernel_0.8.15    
[22] parallel_3.6.1      treeio_1.10.0       withr_2.1.2        
[25] repr_1.1.0          generics_0.0.2      vctrs_0.3.1        
[28] grid_3.6.1          tidyselect_1.1.0    glue_1.3.2         
[31] R6_2.4.1            pbdZMQ_0.3-3        tensorA_0.36.1     
[34] tidyr_1.0.2         purrr_0.3.3         corpcor_1.6.9      
[37] magrittr_1.5        scales_1.0.0        htmltools_0.4.0    
[40] ellipsis_0.3.0      colorspace_1.4-1    cubature_2.0.4     
[43] lazyeval_0.2.2      munsell_0.5.0       crayon_1.3.4       

Simulate a random tree. The structure of the tree can be expressed as a relatedness matrix $A$. $A$ is a symetric matrix where each element $A_{ij}$ represents the branch length from root to split of species $i$ and $j$. This means that the diagonal is the distance of a taxa from the root. You can verrify this by comparing the table and tree below. However, this matrix is often scaled so that the distance from root to tip is 1. This is only possible for ultrametric trees so in this case we do not scale it. Plotted next to the tree in colour it is easy to see that this $A$ matrix ensure high covariance amongst closely related species.

In [105]:
n <- 5
tree<- rtree(n)

edgelabels(round(tree$edge.length,3), bg="white", frame="none", col="dimgrey")
In [106]:
Aphy <- vcv.phylo(tree, cor=F)
round(Aphy, 2)
A matrix: 5 × 5 of type dbl
In [107]:
    ggtree(tree) + geom_tiplab() + ggtitle("Tree and Aphy"),
) +
scale_fill_gradientn(colours=c("#185d75", "#49ebbe"))
# scale_fill_gradientn(colours=c("darkturquoise", "orange"))
Scale for 'fill' is already present. Adding another scale for 'fill', which
will replace the existing scale.

For our actual analysis we simulate a much larger tree. Then we use this $A$ matrix Aphy to enforce the desired corelation structure on to the random intercepts. The variance of this random effect is specified as phyV and the residual variance (of the overall model) as resV. The product of these variables gives the covariance structure that we sample the phylogenetic random effects phyfx from. The dataset sim is simulated with x as a fixed effect (for simplicity we use x as is which is equivalent to a coefficient of 1) and the random intercept drawn from phyfx.

In [4]:
n <- 1000
phyV <- 7
resV <- 2

tree<- rtree(n)
Aphy <- vcv.phylo(tree, cor=F)

phyfx <- mvrnorm(1, mu=rep(0, n), Sigma=Aphy * phyV)
In [5]:
sim <- data.frame(
    x = runif(n),
    phylo = paste0("t",1:n),
    y = NA
sim$y <- rnorm(n, sim$x + phyfx[as.character(sim$phylo)], sd=sqrt(resV))

Next we set the prior for the random effect G and the residual variance R. Then we're ready to run the model, again setting scale = F to allow a non-ultrametric tree. The random effect is fit as phylo which indicates the column of the dataset which matches rows up with nodes in the tree. We leave the thinning interval and number of iterations as defaults here but strongly recommend that you assess and adjust your model to ensure good mixing and convergence.

Finally we plot the distribution of the estimated phylogenetic variance and the residual variance with the true value used for simulations in red.

In [6]:
prior1 <- list(
        G1=list(V=1, nu=0.002)
    R=list(V=1, nu=0.002)

m1 <- MCMCglmm(y~x,
               ginverse=list(phylo=inverseA(tree, scale=F)$Ainv),
In [7]:
densplot(m1$VCV[,'phylo'], main="Phylogenetic variance")
abline(v=phyV, col="red")

densplot(m1$VCV[,'units'], main="Residual variance")
abline(v=resV, col="red")