Title: | Comparative Analyses of Lineage-Pair Traits |
---|---|
Description: | Facilitates the testing of causal relationships among lineage-pair traits in a phylogenetically informed context. Lineage-pair traits are characters that are defined for pairs of lineages instead of individual taxa. Examples include the strength of reproductive isolation, range overlap, competition coefficient, diet niche similarity, and relative hybrid fitness. Users supply a lineage-pair dataset and a phylogeny. 'phylopairs' calculates a covariance matrix for the pairwise-defined data and provides built-in models to test for relationships among variables while taking this covariance into account. Bayesian sampling is run through built-in 'Stan' programs via the 'rstan' package. The various models and methods that this package makes available are described in Anderson et al. (In Review), Coyne and Orr (1989) <doi:10.1111/j.1558-5646.1989.tb04233.x>, Fitzpatrick (2002) <doi:10.1111/j.0014-3820.2002.tb00860.x>, and Castillo (2007) <doi:10.1002/ece3.3093>. |
Authors: | Sean A. S. Anderson [aut, cre] |
Maintainer: | Sean A. S. Anderson <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.0 |
Built: | 2024-11-23 06:52:06 UTC |
Source: | CRAN |
phylopairs
provides tools for conducting comparative analyses of lineage-pair traits in a phylogenetically informed context. Comparative analyses of pairwise-defined traits like "strength of RI" or "diet niche overlap" collected for numerous pairs of related taxa can yield important insights to biologists, but it is important to recognize that such data are not independent. phylopairs
provides a function, taxapair.vcv()
, that calculates the expected covariance structure of a pairwise-defined trait given (1) a table of pairs, (2) a phylogeny, and (3) a chosen model (as described in Anderson et al. in review). This covariance structure can be used in any number of downstream analyses. phylopairs
provides tools for a few such analyses, including linear regression, linear mixed models, generalized least squares, and two forms of beta regression for use with bounded response variables. All analyses are conducted in a Bayesian framework using built-in Stan
software programs interfaced through the rstan
package.
Maintainer: Sean A. S. Anderson [email protected]
Stan Development Team (NA). RStan: the R interface to Stan. R package version 2.32.6. https://mc-stan.org Anderson et al. *in review*. The comparative analysis of lineage-pair data.
Fits one of two beta regression models to a dataset in a Bayesian
framework using Stan
software via the rstan
package. Response variable must be
bounded between 0 and 1. Users can fit either (1) a basic beta regression model or a
(2) beta mixed-effects model in which there are covarying random intercepts in the
linear predictor. For the latter, users must supply a covariance matrix. In both
models, users can choose one of four link functions. Users can alter parameters for
model-parameter prior distributions and Bayesian sampling settings. See details.
betareg.stan(des, y, model="beta.ols", link="logit", covmat=NULL, iter=2000, chains=4, coef.u=0, coef.sd=10, phi.shape=0.01, phi.rate=0.01, physig2.u=-1, physig2.sd=1, cores=4, ...)
betareg.stan(des, y, model="beta.ols", link="logit", covmat=NULL, iter=2000, chains=4, coef.u=0, coef.sd=10, phi.shape=0.01, phi.rate=0.01, physig2.u=-1, physig2.sd=1, cores=4, ...)
des |
A vector of predictor variable observations OR, in the case of multiple predictors, a matrix in which each column is a vector of observations of a given predictor. |
y |
One of "beta.ols" or "beta.mm" for a basic beta regression model or beta mixed model, respectively; defaults to "beta.ols". |
model |
A vector of response variable observations. |
link |
The link function to be used. Default is "logit". Other possible choices are "probit", "cloglog" (complementary log-log), and "loglog". |
covmat |
Covariance matrix for model residuals (a lineage-pair covariance matrix if analyzing lineage-pair data or a phylogenetic vcv matrix if analyzing bounded species data). |
iter |
Number of iterations to run on each chain; defaults to 2000 (more are often necessary). |
chains |
Number of Markov chains to run; defaults to 4. |
coef.u |
Mean of the Gaussian prior for each preditor variable coefficient; defaults to 0. |
coef.sd |
SD of the Gaussian prior for each preditor variable coefficient; defaults to 10. |
phi.shape |
Shape parameter for gamma prior of beta distribution's precision parameter (phi); defaults to 0.01. |
phi.rate |
Rate parameter for gamma prior of beta distribution's precision parameter (phi); defaults to 0.01. |
physig2.u |
Mean of the lognormal prior for the scale of the residual covariance; defaults to -1. |
physig2.sd |
SD of the lognormal prior for the scale of the residual covariance; defaults to 1. |
cores |
Number of cores to be used; defaults to 4 (one chain per core). |
... |
additional arguments passed to |
The two models that can be chosen differ in terms of their linear predictor on the scale of the link function. These predictors are
Simple beta regression: g(u) = XB
Beta mixed-effects regression: g(u) = XB + u, u~N(0, physig2*C)
, where C is a
lineage-pair covariance matrix (if analyzing lineage-pair data) or a
phylogenetic vcv matrix (if analyzing species data) and physig2 is the
scaling parameter for C.
NOTE: For model 2, the user must supply a covariance matrix.
Prior Distributions for Model Parameters: The underlying stan models assume the following prior distributions for model parameters.
Regression coefficients: Gaussian prior (users can set prior mean and sd).
Precision parameter phi
: gamma prior (users can set prior shape and rate of gamma).
physig
: lognormal prior (users can set prior mean and sd).
A list containing two elements: (1) the posterior distribution of model parameters, and (2) the log-likelihood of the posteriors for use in downstream analyses (e.g. the calculation of model fitting metrics like loo or waic)
Anderson, S. A. S., et al. In review. The comparative analysis of lineage-pair data.
## Example 1: Fit beta regression models with different link functions to independent data # Load a dataset of independent response observations simulated with a logit link function data(data5) # Note: data were simulated with Coef[1]=1 (intercept), Coef[2]=0.8 (slope), phi=5 # Run the betareg function result1 = betareg.stan(des=data5[,3], y=data5[,4], iter=1000, cores=2) # Fit the model again but this time use a probit link function result2 = betareg.stan(des=data5[,3], y=data5[,4], link="probit", iter=1000, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] ## Example 2: Fit beta regression models to a dataset with simulated non-independence # Load a dataset of non-independent response observations simulated with a logit link function data(data7) # Note: data were simulated with Coef[1]=1 (intercept), Coef[2]=0.8 (slope), phi=5 # Load the lineage-pair covariance matrix that arose from those simulations data(sim.cov.pairs) # Run the betareg function result1 = betareg.stan(des=data7[,3], y=data7[,4], model="beta.mm", cov=sim.cov.pairs, iter=1000, cores=2) # Fit the model again but this time without the covariance matrix result2 = betareg.stan(des=data7[,3], y=data7[,4], iter=1000, cores=2) # Fit the model a third time with the cov. matrix but now with a probit link function result3 = betareg.stan(des=data7[,3], y=data7[,4], model="beta.mm", cov=sim.cov.pairs, link="probit", iter=1000, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] result3[[1]] # Compare the fit of the three models via leave-on-out (loo) cross validation. loo1 = suppressWarnings(loo::loo(result1[[2]])) loo2 = suppressWarnings(loo::loo(result2[[2]])) loo3 = suppressWarnings(loo::loo(result3[[2]])) loo::loo_compare(loo1, loo2, loo3)
## Example 1: Fit beta regression models with different link functions to independent data # Load a dataset of independent response observations simulated with a logit link function data(data5) # Note: data were simulated with Coef[1]=1 (intercept), Coef[2]=0.8 (slope), phi=5 # Run the betareg function result1 = betareg.stan(des=data5[,3], y=data5[,4], iter=1000, cores=2) # Fit the model again but this time use a probit link function result2 = betareg.stan(des=data5[,3], y=data5[,4], link="probit", iter=1000, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] ## Example 2: Fit beta regression models to a dataset with simulated non-independence # Load a dataset of non-independent response observations simulated with a logit link function data(data7) # Note: data were simulated with Coef[1]=1 (intercept), Coef[2]=0.8 (slope), phi=5 # Load the lineage-pair covariance matrix that arose from those simulations data(sim.cov.pairs) # Run the betareg function result1 = betareg.stan(des=data7[,3], y=data7[,4], model="beta.mm", cov=sim.cov.pairs, iter=1000, cores=2) # Fit the model again but this time without the covariance matrix result2 = betareg.stan(des=data7[,3], y=data7[,4], iter=1000, cores=2) # Fit the model a third time with the cov. matrix but now with a probit link function result3 = betareg.stan(des=data7[,3], y=data7[,4], model="beta.mm", cov=sim.cov.pairs, link="probit", iter=1000, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] result3[[1]] # Compare the fit of the three models via leave-on-out (loo) cross validation. loo1 = suppressWarnings(loo::loo(result1[[2]])) loo2 = suppressWarnings(loo::loo(result2[[2]])) loo3 = suppressWarnings(loo::loo(result3[[2]])) loo::loo_compare(loo1, loo2, loo3)
Tets for validity of a covariance matrix based on four conditions: symmetry, diagonal dominance, positive definiteness, and positive variance.
covmat.check(mat)
covmat.check(mat)
mat |
A putative covariance matrix. |
A valid covariance matrix must be symmetric, diagonally dominant (largest values
in each row are on the diagonal), positive definite, and have positive variance.
covmat.check
takes a matrix as input and tests for these four conditions.
A data.frame containing logical "TRUE" or "FALSE" for each condition.
# Load sample covariance matrix data(sim.cov.pairs) # Test for validity covmat.check(sim.cov.pairs)
# Load sample covariance matrix data(sim.cov.pairs) # Test for validity covmat.check(sim.cov.pairs)
Fits one of three models of linear regression to a dataset in a Bayesian framework. Bayesian sampling is conducted in the Stan software via the 'rstan' package. Users supply vectors of observations and, unless an "ols" model is chosen, a covariance matrix. Users can alter parameters for model-parameter prior distributions and Bayesian sampling settings. See details.
linreg.stan(des, y, model="ols", covmat=NULL, iter=2000, chains=4, cores=4, coef.u=0, coef.sd=10, physig2.u=-1, physig2.sd=1, randsig2.loc=0, randsig2.sc=2.5, ...)
linreg.stan(des, y, model="ols", covmat=NULL, iter=2000, chains=4, cores=4, coef.u=0, coef.sd=10, physig2.u=-1, physig2.sd=1, randsig2.loc=0, randsig2.sc=2.5, ...)
des |
A vector of predictor variable observations OR, in the case of multiple predictors, a matrix in which each column is a vector of observations of a given predictor. Function will add a column of 1s to make this a design matrix whose first column corresponds to the model intercept (unless such a column already exists). |
y |
A vector of response variable observations. |
model |
Choice of "ols", "pgls", or "pgls.mm"; defaults to ols. |
covmat |
Covariance matrix for model residuals (a lineage-pair covariance matrix if analyzing lineage-pair data or a phylogenetic vcv matrix if analyzing species data). |
iter |
Number of iterations to run on each chain; defaults to 2000 (more are often necessary). |
chains |
Number of Markov chains to run; defaults to 4. |
cores |
Number of cores to use. |
coef.u |
Mean of the Gaussian prior for each preditor variable coefficient; defaults to 0. |
coef.sd |
SD of the Gaussian prior for each preditor variable coefficient; defaults to 10. |
physig2.u |
Mean of the prior distribution (lognormal) for the scale of the phylogenetic component of residual covariance; defaults to -1. |
physig2.sd |
SD of the prior distribution (lognormal) for the scale of the phylogenetic component of residual covariance; defaults to 1. |
randsig2.loc |
Location parameter of the prior distribution (cauchy) for the scale of the independent component of residual covariance ; defaults to 0. |
randsig2.sc |
Scale parameter of the prior distribution (cauchy) for the scale of the independent component of residual covariance ; defaults to 2.5. |
... |
additional arguments passed to |
The models that can be chosen are:
"ols": basic ordinary least-squares regression.
Y = XB + epsilon
; where epsilon~N(0,randsig2*I)
and randsig2 is the scaling parameter for identity matrix I.
"pgls": phylogenetic generalized least-squares regression in which residuals covary according to covariance matrix C. This is a lineage-pair covariance matrix (if analyzing lineage-pair data) or a phylogenetic vcv matrix (if analyzing species data).
Y = XB + epsilon
; where epsilon~N(0,physig2*C)
and physig2 is the scaling parameter for covariance matrix C.
"pgls.mm" a mixed-effects version of pgls in which the residuals have both uncorrelated and correlated components, where the latter are structured according to covariance matrix C.
Y = XB + epsilon + u
; where epsilon~N(0,randsig2*I)
and u~N(0,physig2*C)
.
NOTE: For models 2 and 3, the user must supply a covariance matrix.
Prior Distributions for Regression Model Parameters: The underlying stan models assume the following prior distributions for model parameters
Regression coefficients: Gaussian prior (users can set prior mean and sd).
physig2
: lognormal prior (users can set prior mean and sd).
randsig2
: cauchy prior (users can set location and scale parameters of prior).
A list containing two elements: (1) the posterior distribution of model parameters, and (2) the log-likelihood of the posteriors for use in downstream analyses (e.g. the calculation of model fitting metrics like loo or waic)
Anderson, S. A. S., et al. In review. The comparative analysis of lineage-pair data.
## Fit regression models with and without covariance matrix # Note: data were simulated with Coef[1]=1 (intercept), Coef[2]=0.8 (slope) # Load a data simulated with a non-independent response observations data(data3) # Also load the lineage-pair covariance matrix that arose from those simulations data(sim.cov.pairs) # Fit an OLS model result1 = linreg.stan(des=data3[,3], y=data3[,4], cores=2) # Fit an pgls model result2 = linreg.stan(des=data3[,3], y=data3[,4], model="pgls", covmat=sim.cov.pairs, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] # Compare the fit of the two models via loo and waic loo1 = loo::loo(result1[[2]]) loo2 = loo::loo(result2[[2]]) waic1 = loo::waic(result1[[2]]) waic2 = loo::waic(result2[[2]]) loo1 loo2 waic1 waic2 loo::loo_compare(loo1, loo2) loo::loo_compare(waic1, waic2) # Extend the comparison by fitting a pgls.mm model result3 = linreg.stan(des=data3[,3], y=data3[,4], model="pgls.mm", covmat=sim.cov.pairs, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] result3[[1]] # Compare the fit of the three models via loo and waic loo3 = loo::loo(result3[[2]]) waic3 = loo::waic(result3[[2]]) loo::loo_compare(loo1, loo2, loo3) loo::loo_compare(waic1, waic2, waic3)
## Fit regression models with and without covariance matrix # Note: data were simulated with Coef[1]=1 (intercept), Coef[2]=0.8 (slope) # Load a data simulated with a non-independent response observations data(data3) # Also load the lineage-pair covariance matrix that arose from those simulations data(sim.cov.pairs) # Fit an OLS model result1 = linreg.stan(des=data3[,3], y=data3[,4], cores=2) # Fit an pgls model result2 = linreg.stan(des=data3[,3], y=data3[,4], model="pgls", covmat=sim.cov.pairs, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] # Compare the fit of the two models via loo and waic loo1 = loo::loo(result1[[2]]) loo2 = loo::loo(result2[[2]]) waic1 = loo::waic(result1[[2]]) waic2 = loo::waic(result2[[2]]) loo1 loo2 waic1 waic2 loo::loo_compare(loo1, loo2) loo::loo_compare(waic1, waic2) # Extend the comparison by fitting a pgls.mm model result3 = linreg.stan(des=data3[,3], y=data3[,4], model="pgls.mm", covmat=sim.cov.pairs, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] result3[[1]] # Compare the fit of the three models via loo and waic loo3 = loo::loo(result3[[2]]) waic3 = loo::waic(result3[[2]]) loo::loo_compare(loo1, loo2, loo3) loo::loo_compare(waic1, waic2, waic3)
Calculate weighted or unweighted node-averaged values of a lineage-pair trait.
node.averager(dataset, tree, taxacolumns, varb, weighted=FALSE, prune=TRUE, av=TRUE)
node.averager(dataset, tree, taxacolumns, varb, weighted=FALSE, prune=TRUE, av=TRUE)
dataset |
A data.frame in which each row corresponds to a pair in the dataset. Must contain two columns of taxa names (one for each taxon in every pair) and at least one data column with the pairwise-defined trait that is to be averaged. Taxa names must be in same format as that used in the tree. |
tree |
An ultrametric phylogenetic tree ('phylo' object) containing the species that appear in at least one pair in the dataset. Names must be in the same format as those used in 'dataset'. |
taxacolumns |
Character vector containing the column names for the two columns containing species names (e.g. c("sp1", "sp2")) |
varb |
The variable to be averaged (e.g. "RI", "range_overlap", etc.) |
weighted |
Logical indicating whether weighted node averages are to be calculated; defaults to FALSE. |
prune |
Logical indicating whether tree should be pruned to contain just the species represented in the dataset; defaults to TRUE. |
av |
Logical indicating whether to average the values of multiple entries for the same pair, should they appear in the dataset; defaults to TRUE. If set to FALSE, function will stop in the case of more than one entry in the dataset corresponding to the same pair. |
node.averager()
takes a lineage-pair dataset and a phylogenetic tree and
returns the average value of the pairwise-defined trait at each node. It calculates,
at each node in the tree, the average value of a pairwise-defined trait for all pairs
whose species span the node.
The simple or 'unweighted' average is calculated as introduced by Coyne and Orr (1989). The 'weighted' node averaging procedure was introduced by Fitzpatrick (2002) and discussed in Fitzpatrick and Turelli (2006). In weighted averaging, the trait values for the pairs spanning a node are first halved K-1 times, where K is the number of nodes between the species in a pair. These halved values are then summed to get arrive at a weighted average for a node.
Note: For datasets containing many individual species, the best available
tree might be very large. By default, node.averager()
prunes the tree to
contain just the species represented in the dataset. Beware that pruning can
affect both the number of nodes and the node averaged values (for example by
altering the number of nodes between pairs when calculating weighted node averages).
Numeric vector of node averages, named according to node indices.
Coyne, J. A., Orr, H. A. 1989. Patterns of speciation in Drosophila. Evolution 43:362-381.
Fitzpatrick, B. M. 2002. Molecular correlates of reproductive isolation. Evolution 56:191-198.
Fitzpatrick, B. M., Turelli. 2006. The geography of mammalian speciation: mixed signals from phylogenies and range maps. Evolution 60:601-615.
# Load simulated dataset and tree data(data1) data(sim.tree1) # Perform node averaging unwtd <- node.averager(dataset = data1, tree = sim.tree1, varb = "pred", taxacolumns = c("sp1", "sp2")) wtd <- node.averager(dataset = data1, tree = sim.tree1, varb = "pred", taxacolumns = c("sp1", "sp2"), weighted = TRUE) # Compare outcomes of weighted and unweighted node averaging unwtd wtd summary(unwtd) summary(wtd) # Calculate data loss nrow(data1) - length(wtd) nrow(data1) - length(unwtd) # Plot tree and node labels library(ape) plot(sim.tree1) nodelabels()
# Load simulated dataset and tree data(data1) data(sim.tree1) # Perform node averaging unwtd <- node.averager(dataset = data1, tree = sim.tree1, varb = "pred", taxacolumns = c("sp1", "sp2")) wtd <- node.averager(dataset = data1, tree = sim.tree1, varb = "pred", taxacolumns = c("sp1", "sp2"), weighted = TRUE) # Compare outcomes of weighted and unweighted node averaging unwtd wtd summary(unwtd) summary(wtd) # Calculate data loss nrow(data1) - length(wtd) nrow(data1) - length(unwtd) # Plot tree and node labels library(ape) plot(sim.tree1) nodelabels()
Takes a lineage-pair dataset and a phylogenetic tree and returns the 'age' of the node representing the MRCA for each pair. 'Age' is measured from the present and is in whatever units are represented by branch lengths of the tree (which is time in a dated phylogeny).
Takes a lineage-pair dataset and a phylogenetic tree and returns the 'age' of the node representing the MRCA for each pair. 'Age' is measured from the present and is in whatever units are represented by branch lengths of the tree (which is time in a dated phylogeny).
pair.age.in.tree(dataset, tree, taxacolumns) pair.age.in.tree(dataset, tree, taxacolumns)
pair.age.in.tree(dataset, tree, taxacolumns) pair.age.in.tree(dataset, tree, taxacolumns)
dataset |
A data.frame in which each row corresponds to a pair in the dataset. Must contain two columns of taxa names (one for each taxon in every pair), and the taxa names must be in same format as that used in the tree. |
tree |
An ultrametric phylogenetic tree ('phylo' object) containing the species that appear in at least one pair in the dataset. Names must be in the same format as those used in 'dataset'. |
taxacolumns |
Character vector containing the column names for the two columns containing species names (e.g. c("sp1", "sp2")) |
Function is a wrapper for ape::node.depth.edgelength
. That function returns the depths of all the nodes in a tree as measured from the root. phylopairs::node.age
converts these values to depth as measured from the present and returns the values corresponding to each pair in a user-supplied lineage-pair dataset.
Function is a wrapper for ape::node.depth.edgelength()
. That function returns the depths of all the nodes in a tree as measured from the root. phylopairs::node.age()
converts these values to depth as measured from the present and returns the values corresponding to each pair in a user-supplied lineage-pair dataset.
A numeric vector of node ages ordered to match the rows in 'dataset'.
A numeric vector of node ages ordered to match the rows in 'dataset'.
# Load a dataset and a tree data(data1) data(sim.tree1) # Find the node.ages and add as a column to the dataset data1$age = pair.age.in.tree(dataset=data1, tree=sim.tree1, taxacolumns=c("sp1","sp2")) head(data1) # Plot tree with axis in units of branch lengths and perform visual check that dates are correct library(ape) plot(sim.tree1) axisPhylo() # Load a dataset and a tree data(data1) data(sim.tree1) # Find the node.ages and add as a column to the dataset data1$age = pair.age.in.tree(dataset=data1, tree=sim.tree1, taxacolumns=c("sp1","sp2")) head(data1) # Plot tree with axis in units of branch lengths and perform visual check that dates are correct library(ape) plot(sim.tree1) axisPhylo()
# Load a dataset and a tree data(data1) data(sim.tree1) # Find the node.ages and add as a column to the dataset data1$age = pair.age.in.tree(dataset=data1, tree=sim.tree1, taxacolumns=c("sp1","sp2")) head(data1) # Plot tree with axis in units of branch lengths and perform visual check that dates are correct library(ape) plot(sim.tree1) axisPhylo() # Load a dataset and a tree data(data1) data(sim.tree1) # Find the node.ages and add as a column to the dataset data1$age = pair.age.in.tree(dataset=data1, tree=sim.tree1, taxacolumns=c("sp1","sp2")) head(data1) # Plot tree with axis in units of branch lengths and perform visual check that dates are correct library(ape) plot(sim.tree1) axisPhylo()
Simulated datasets are provided to help users understand functions. Lineage-pairs (n=190) were created from a simulated phylogenetic tree with 20 tips. A predictor variable was generated from random draws of a standard normal. Reponse variables of various structures were simulated as follows:
Simulated Dataset 1 Unbounded response, linear relationship between response and predictor, no covariance in residuals
Simulated Dataset 2 Unbounded response, no relationship between response and predictor, no covariance in residuals
Simulated Dataset 3 Unbounded response, linear relationship between response and predictor, covariance in residuals
Simulated Dataset 4 Unbounded response, no relationship between response and predictor, covariance in residuals
Simulated Dataset 5 Bounded response, linear relationship between response and predictor on link scale, no covariance in residuals
Simulated Dataset 6 Bounded response, no relationship between response and predictor on link scale, no covariance in residuals
Simulated Dataset 7 Bounded response, linear relationship between response and predictor on link scale, covariance in residuals
Simulated Dataset 8 Bounded response, no relationship between response and predictor on link scale, covariance in residuals
Simulated Lineage-Pair Covariance Matrix A 190*190 covariance matrix used in simulating the example datasets in this package.
Simulated Phylogenetic Tree A 20-species ultrametric phylogenetic tree used in simulating the example datasets in this package.
The datasets are as follows
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A data frame with 190 rows and 4 columns.
Identity of Species 1 in Pair
Identity of Species 2 in Pair
Numeric predictor variable
Numeric response variable
A 190x190 covariance matrix
A phylo object
Simulated data generated with the script provided in the 'inst' directory.
Calculates an unscaled lineage-pair covariance matrix for use in downstream analyses.
taxapair.vcv(sp.pairs, tree, spnames=FALSE, dec=2, model="sq.diff", regularize=FALSE, regparam=NULL)
taxapair.vcv(sp.pairs, tree, spnames=FALSE, dec=2, model="sq.diff", regularize=FALSE, regparam=NULL)
sp.pairs |
A table (matrix or data.frame) in which the first column contains the names of 'species 1' and the second column contains the names of 'species 2'. Names must be in the same format used in the phylogenetic tree. |
tree |
An ultrametric phylogenetic tree ('phylo' object) containing species that appear in the dataset (as either a species 1 or species 2 or both). |
spnames |
Logical determining whether to use species names as row and column labels for matrix. If TRUE, then the name of a row or column will be in the form "species1_species2". If FALSE, names are formed from the indices of the species names in the tree "indexOfSpecies1_indexOfSpecies2". Defaults to FALSE. |
dec |
Number of decimal places to round the values in the matrix; defaults to 2. |
model |
One of 'sq.diff', 'sq.sum', 'simple.sum', and 'product'. Defaults to 'sq.diff'. See details. |
regularize |
Logical indicating whether regularization should be used if resulting matrix is numerically singular; defaults to FALSE. If TRUE, regularization is conducted by adding a small value to the diagonal of the matrix. By default, this value is equal to 1% of the median value on the diagonal, which is continually added until matrix is no longer numerically singular. |
regparam |
Custom regularization parameter. Instead of adding the default 1% of the median diagonal value to the diagonal, it will add regparam * that median value. IMPORTANT: it only does this once and does not continue adding the number until the matrix is non-singular. For 2%, write 0.02; for 10%, write 0.10, and so on. |
Just as the traits of different species are not independent due to varying
amounts of shared evolutionary history, so too are pairwise-defined traits like
'strength of RI' and 'range overlap' not independent among related taxonomic pairs.
The function taxapair.vcv()
calculates the expected covariance structure for
lineage-pair datasets given the taxa in each pair and a phylogenetic tree
(containing the taxa that appear in the dataset). The exact structure of this
covariance depends on the underlying model by which phylogenetic signal among
taxa is expected to translate into non-independence among taxonomic pairs
(Anderson et al. in review).
Let X be some underlying continuous biological character (or set of characters) that is defined for each taxon and that has phylogenetic signal. Users can choose one of four models by which signal in X generates covariance among pairs in a lineage-pair trait:
sq.diff (default) – the lineage-pair trait is influenced by the absolute difference between the two taxa in X.
sq.sum – the lineage-pair trait is linearly by the squared sum of the value of X in each taxon.
simple.sum – the lineage-pair trait is influenced by the sum of the the value of X in each taxon. IMPORTANT: this model tends to result in a singular matrix.
product – the lineage-pair trait is influenced by the product of the the value of X in each taxon.
In each case, it is assumed that there is a linear relationship between the value calculated in each model and the response variable.
A lineage-pair covariance matrix.
Anderson, S. A. S., et al. In review. The comparative analysis of lineage-pair data.
library(ape) # Load simulated dataset and tree data(data1) data(sim.tree1) # Calculate the lineage-pair covariance matrix linpair.mat = taxapair.vcv(sp.pairs=data1[,1:2], tree=sim.tree1) dim(linpair.mat) # Check the validity of the matrix covmat.check(linpair.mat)
library(ape) # Load simulated dataset and tree data(data1) data(sim.tree1) # Calculate the lineage-pair covariance matrix linpair.mat = taxapair.vcv(sp.pairs=data1[,1:2], tree=sim.tree1) dim(linpair.mat) # Check the validity of the matrix covmat.check(linpair.mat)
Calculates the four matrices required for fitting the two-term plmm model of Castillo (2007). See details.
twoterm.plmm.mats(sp.pairs, tree)
twoterm.plmm.mats(sp.pairs, tree)
sp.pairs |
A table (matrix or data.frame) in which the first column contains the names of 'species 1' and the second column contains the names of 'species 2'. Names must be in the same format used in the phylogenetic tree. |
tree |
An ultrametric phylogenetic tree ('phylo' object) describing the relationships among the species that appear in the dataset (as either a species 1 or species 2 or both). |
Castillo (2007) introduced a framework for analyzing lineage-pair data via an
extension of a phylogenetic linear mixed model (plmm) in which there are two
random-effect terms, one for the 'species 1' and another for the 'species 2' in
every pair. The model is Y = Xb + Z1u1 + Z1u2 + epsilon
, where b is the vector of
coefficients, X is a design matrix, u1 and u2 are vectors of species-specific random
effects that covary according to a phylogenetic covariance matrix C (u1 ~ N(0,C)
and u2 ~ N(0,C)
), Z1 and Z2 are design matrices that map the species-specific
effects to the correct species in each pair, and epsilon is residual error (epsilon~N(0,C)
).
The C matrix in the model is scaled by parameters that do not come into play here.
This function takes a table containng columsn for species 1 and species 2 for every pair and a phylogenetic tree and returns the matrices Z1 and Z2 as well as pruned phylogenetic covariance matrices for u1 and u2. This pruning is sometimes required because not all species found in the dataset will appear as both 'species 1' and 'species 2'. Z1 and Z2 will therefore have different sizes and u1 and u2 require different covariance matrices. Note that the matrices themselves are pruned, not the tree from which they are derived, as the latter could result in the covariance between two species being different for the 'species 1' and 'species 2' random effects.
A list of four matrices: Z1, Z2, cov1, and cov2, where the latter two describe the covariance among the random effects in u1 and u2.
Castillo, D. M. (2007). Factors contributing to the accumulation of reproductive isolation: A mixed model approach. Ecology and Evolution 7:5808-5820. doi.org/10.1002/ece3.3093
# Simulate a tree lin.tree = phytools::pbtree(n=20) # Generate lineage pairs as the pairwise combinations of species in the tree lin.pairs = data.frame(t(combn(lin.tree$tip.label,2))); colnames(lin.pairs)=c("sp1", "sp2") # Calculate the matrices mats = twoterm.plmm.mats(sp.pairs=lin.pairs, tree=lin.tree) # Check structure of design matrices sapply(mats, dim) head(mats$Z1[,1:5]) head(mats$Z2[,1:5]) head(mats$cov2[,1:5]) head(mats$cov2[,1:5]) # Ensure covariance matrices are valid covariance matrices sapply(mats[3:4], covmat.check)
# Simulate a tree lin.tree = phytools::pbtree(n=20) # Generate lineage pairs as the pairwise combinations of species in the tree lin.pairs = data.frame(t(combn(lin.tree$tip.label,2))); colnames(lin.pairs)=c("sp1", "sp2") # Calculate the matrices mats = twoterm.plmm.mats(sp.pairs=lin.pairs, tree=lin.tree) # Check structure of design matrices sapply(mats, dim) head(mats$Z1[,1:5]) head(mats$Z2[,1:5]) head(mats$cov2[,1:5]) head(mats$cov2[,1:5]) # Ensure covariance matrices are valid covariance matrices sapply(mats[3:4], covmat.check)
Fits the two-term plmm model of Castillo (2007) in a Bayesian framework. Bayesian sampling is conducted in the Stan software via the 'rstan' package. Users supply vectors of observations and an ultrametric phylogenetic tree. Users can alter parameters for model-parameter prior distributions and Bayesian sampling settings. See details.
twoterm.plmm.stan(des, y, sp1s, sp2s, tree, iter=2000, chains=4, coef.u=0, coef.sd=10, physig2.u=-1, physig2.sd=1, randsig2.loc=0, randsig2.sc=2.5, cores=4, ...)
twoterm.plmm.stan(des, y, sp1s, sp2s, tree, iter=2000, chains=4, coef.u=0, coef.sd=10, physig2.u=-1, physig2.sd=1, randsig2.loc=0, randsig2.sc=2.5, cores=4, ...)
des |
A vector of predictor variable observations OR, in the case of multiple predictors, a matrix in which each column is a vector of observations of a given predictor. Function will add a column of 1s to make this a design matrix whose first column corresponds to the model intercept (unless such a column already exists). |
y |
A vector of response variable observations. |
sp1s |
Character vector naming the "species 1" of every pair. IMPORTANT: name formatting must match that used in the tip.label of the tree |
sp2s |
Character vector naming the "species 2" of every pair. IMPORTANT: name formatting must match that used in the tip.label of the tree |
tree |
Phylogenetic tree (a 'phylo' object) for the species included in the analysis (regardless of whether they are species 1 or 2). |
iter |
Number of iterations to run on each chain; defaults to 2000 (more are often necessary). |
chains |
Number of Markov chains to run; defaults to 4. |
coef.u |
Mean of the Gaussian prior for each preditor variable coefficient; defaults to 0. |
coef.sd |
SD of the Gaussian prior for each preditor variable coefficient; defaults to 10. |
physig2.u |
Mean of the prior distribution (lognormal) for the scale of the phylogenetic component of residual covariance; defaults to -1. |
physig2.sd |
SD of the prior distribution (lognormal) for the scale of the phylogenetic component of residual covariance; defaults to 1. |
randsig2.loc |
Location parameter of the prior distribution (cauchy) for the scale of the independent component of residual covariance ; defaults to 0. |
randsig2.sc |
Scale parameter of the prior distribution (cauchy) for the scale of the independent component of residual covariance ; defaults to 2.5. |
cores |
Number of cores to use. |
... |
additional arguments passed to |
The model introduced by Castillo (2007) for analyzing lineage-pair data is a
version of a phylogenetic linear mixed model (plmm) in which there are two
random-effect terms, one for the 'species 1' and another for the 'species 2' in
every pair. The model is Y = Xb + Z1u1 + Z1u2 + epsilon
, where u1 and u2 are
vectors of species-specific random effects that covary according to a phylogenetic
covariance matrix C. In this model, u1 ~ N(0,physig2*C)
, u2 ~ N(0,physig2*C)
, and
epsilon~N(0, randsig2*I)
, where randsig2 is the scaling parameter for identity
matrix I and physig2 is the scaling parameter for phylogenetic covariance
matrix C. See twoterm_plmm_mats
for details on calculating
the Z1 and Z2 matrices.
Prior Distributions for Model Parameters: The underlying stan models assume the following prior distributions for model parameters
Regression coefficients: Gaussian prior (users can set prior mean and sd)
physig2
: lognormal prior (users can set prior mean and sd)
randsig2
: cauchy prior (users can set location and scale parameters of prior)
A list containing two elements: (1) the posterior distribution of model parameters, and (2) the log-likelihood of the posteriors for use in downstream analyses (e.g. the calculation of model fitting metrics like loo or waic)
Castillo, D. M. (2007). Factors contributing to the accumulation of reproductive isolation: A mixed model approach. Ecology and Evolution 7:5808-5820. doi.org/10.1002/ece3.3093
# Load a dataset simulated with a non-independent response observations data(data3) # Also load the simulated tree that was used to generate those pairs data(sim.tree1) # Fit an OLS model result1 = linreg.stan(des=data3[,3], y=data3[,4], cores=2) # Fit the twoterm.plmm.stan model result2 = twoterm.plmm.stan(des=data3[,3], y=data3[,4], sp1s=data3[,1], sp2s=data3[,2], tree=sim.tree1, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] # Compare the fit of the two models via loo and waic loo1 = loo::loo(result1[[2]]) loo2 = loo::loo(result2[[2]]) waic1 = loo::waic(result1[[2]]) waic2 = loo::waic(result2[[2]]) loo1 loo2 waic1 waic2 loo::loo_compare(loo1, loo2) loo::loo_compare(waic1, waic2)
# Load a dataset simulated with a non-independent response observations data(data3) # Also load the simulated tree that was used to generate those pairs data(sim.tree1) # Fit an OLS model result1 = linreg.stan(des=data3[,3], y=data3[,4], cores=2) # Fit the twoterm.plmm.stan model result2 = twoterm.plmm.stan(des=data3[,3], y=data3[,4], sp1s=data3[,1], sp2s=data3[,2], tree=sim.tree1, cores=2) # Compare posterior parameter estimates result1[[1]] result2[[1]] # Compare the fit of the two models via loo and waic loo1 = loo::loo(result1[[2]]) loo2 = loo::loo(result2[[2]]) waic1 = loo::waic(result1[[2]]) waic2 = loo::waic(result2[[2]]) loo1 loo2 waic1 waic2 loo::loo_compare(loo1, loo2) loo::loo_compare(waic1, waic2)