Maximum likelihood by hand

With the function pml_bb from phangorn (Schliep 2011) a lot of steps have become easier and shorter. If you want to have more control over all of the used parameters, it is also possible to use the older functions, e.g. optim_pml. The data is the same as in the vignette Estimating phylogenetic trees with phangorn:

library(ape)
library(phangorn)
fdir <- system.file("extdata/trees", package = "phangorn")
primates <- read.phyDat(file.path(fdir, "primates.dna"),
                        format = "interleaved")

As a starting tree, we calculate a neighbor joining tree:

dm <- dist.ml(primates)
treeNJ  <- NJ(dm)

fit <- pml(treeNJ, data=primates)
fit

## model: JC 
## loglikelihood: -3075 
## unconstrained loglikelihood: -1230 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
##    a    c    g    t 
## 0.25 0.25 0.25 0.25

The function pml returns an object of class pml. This object contains the data, the tree and many different parameters of the model like the likelihood. There are many generic functions for the class pml available, which allow the handling of these objects.

methods(class="pml")

##  [1] AICc   BIC    anova  glance logLik plot   print  simSeq update vcov  
## see '?methods' for accessing help and source code

The object fit just estimated the likelihood for the tree it got supplied, but the branch length are not optimized for the Jukes-Cantor (Jukes and Cantor 1969) model yet, which can be done with the function optim.pml.

fitJC  <- optim.pml(fit, rearrangement="NNI")

## optimize edge weights:  -3075 --> -3068 
## optimize edge weights:  -3068 --> -3068 
## optimize topology:  -3068 --> -3068  NNI moves:  1 
## optimize edge weights:  -3068 --> -3068 
## optimize topology:  -3068 --> -3068  NNI moves:  0

logLik(fitJC)

## 'log Lik.' -3068 (df=25)

With the default values pml will estimate a Jukes-Cantor model. That means equal base frequencies and all transition rates are equal. The generic function update allows to change parameters manually. This is not what we usually want to do. However we might want to supply a different tree or change the number of rate categories.

fitF81 <- update(fitJC, k=4, inv=0.2, bf=baseFreq(primates))
fitF81

## model: F81+G(4)+I 
## loglikelihood: -3037 
## unconstrained loglikelihood: -1230 
## Proportion of invariant sites: 0.2 
## Model of rate heterogeneity: Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 1 
##     Rate Proportion
## 1 0.0000        0.2
## 2 0.1712        0.2
## 3 0.5959        0.2
## 4 1.2500        0.2
## 5 2.9829        0.2
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
##       a       c       g       t 
## 0.37481 0.40160 0.03911 0.18448

In the line above we changed the model to a (discrete) rate across site model with 4 rate categories (using the default shape parameter of 1), to 0.2 invariant sites and supply empirical base frequencies.

fitGTR <- optim.pml(fitF81, model="GTR", optInv=TRUE, optGamma=TRUE,
    rearrangement = "NNI", control = pml.control(trace = 0))
fitGTR

## model: GTR+G(4)+I 
## loglikelihood: -2611 
## unconstrained loglikelihood: -1230 
## Proportion of invariant sites: 0.006978 
## Model of rate heterogeneity: Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 3.081 
##     Rate Proportion
## 1 0.0000   0.006978
## 2 0.3982   0.248256
## 3 0.7411   0.248256
## 4 1.0905   0.248256
## 5 1.7982   0.248256
## 
## Rate matrix:
##         a         c         g      t
## a  0.0000  0.947235 63.592883  0.807
## c  0.9472  0.000000  0.003986 24.635
## g 63.5929  0.003986  0.000000  1.000
## t  0.8070 24.634579  1.000000  0.000
## 
## Base frequencies:  
##       a       c       g       t 
## 0.37481 0.40160 0.03911 0.18448

We will change the model to the GTR + Γ(4) + I model and then optimize all the parameters.

With the control parameters the thresholds for the fitting process can be changed. Here we want just to suppress output during the fitting process. For larger trees the NNI rearrangements often get stuck in a local maximum. We added two stochastic algorithms to improve topology search. The first (set rearrangement="stochastic") performs stochastic rearrangements similar as in (Nguyen et al. 2015), which makes random NNI permutation to the tree, which than gets optimized to escape local optima. The second option (rearrangement="ratchet") perform the likelihood ratchet (Vos 2003).

While these algorithms may find better trees they will also take more time.

fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
    rearrangement = "stochastic", control = pml.control(trace = 0))
fitGTR

## model: GTR+G(4)+I 
## loglikelihood: -2608 
## unconstrained loglikelihood: -1230 
## Proportion of invariant sites: 0.007413 
## Model of rate heterogeneity: Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 2.995 
##     Rate Proportion
## 1 0.0000   0.007413
## 2 0.3917   0.248147
## 3 0.7367   0.248147
## 4 1.0906   0.248147
## 5 1.8109   0.248147
## 
## Rate matrix:
##         a         c         g       t
## a  0.0000  0.719971 73.923279  0.5986
## c  0.7200  0.000000  0.003285 25.8696
## g 73.9233  0.003285  0.000000  1.0000
## t  0.5986 25.869567  1.000000  0.0000
## 
## Base frequencies:  
##       a       c       g       t 
## 0.37481 0.40160 0.03911 0.18448

Model comparison

We can compare nested models for the JC and GTR + Γ(4) + I model using likelihood ratio statistic

anova(fitJC, fitGTR)

## Likelihood Ratio Test Table
##   Log lik. Df Df change Diff log lik. Pr(>|Chi|)    
## 1    -3068 25                                       
## 2    -2608 35        10           921     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

with the Shimodaira-Hasegawa test

SH.test(fitGTR, fitJC)

##      Trees  ln L Diff ln L p-value
## [1,]     1 -2608       0.0  0.5013
## [2,]     2 -3068     460.5  0.0000

or with the AIC

AIC(fitJC)

## [1] 6187

AIC(fitGTR)

## [1] 5286

AICc(fitGTR)

## [1] 5298

BIC(fitGTR)

## [1] 5406

Bootstrap

At last we may want to apply standard bootstrap to test how well the edges of the tree are supported. This has already been shown in the vignette Estimating phylogenetic trees with phangorn.

bs <- bootstrap.pml(fitJC, bs=100, optNni=TRUE,
    control = pml.control(trace = 0))

Now we can plot the tree with the bootstrap support values on the edges and also look at consensusNet to identify potential conflict.

plotBS(midpoint(fitJC$tree), bs, p = 50, type="p")

Tree with bootstrap support. Unrooted tree (midpoint rooted) with bootstrap support values.

cnet <- consensusNet(bs, p=0.2)
plot(cnet, show.edge.label=TRUE)

ConsensusNet from the bootstrap sample.

Generating trees

phangorn has several functions to generate tree topologies, which may are interesting for simulation studies. allTrees computes all possible bifurcating tree topologies either rooted or unrooted for up to 10 taxa. One has to keep in mind that the number of trees is growing exponentially, use howmanytrees from ape as a reminder.

trees <- allTrees(5)
par(mfrow=c(3,5), mar=rep(0,4))
for(i in 1:15)plot(trees[[i]], cex=1, type="u")

nni returns a list of all trees which are one nearest neighbor interchange away.

nni(trees[[1]])

## 4 phylogenetic trees

rNNI and rSPR generate trees which are a defined number of NNI (nearest neighbor interchange) or SPR (subtree pruning and regrafting) away.

Session info

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phangorn_2.12.1 ape_5.8-1       rmarkdown_2.29 
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-2      gtable_0.3.6      jsonlite_1.8.9    dplyr_1.1.4      
##  [5] compiler_4.4.2    tidyselect_1.2.1  Rcpp_1.0.14       parallel_4.4.2   
##  [9] jquerylib_0.1.4   scales_1.3.0      yaml_2.3.10       fastmap_1.2.0    
## [13] lattice_0.22-6    ggplot2_3.5.1     R6_2.6.1          labeling_0.4.3   
## [17] generics_0.1.3    igraph_2.1.4      knitr_1.49        tibble_3.2.1     
## [21] maketools_1.3.2   munsell_0.5.1     pillar_1.10.1     bslib_0.9.0      
## [25] rlang_1.1.5       fastmatch_1.1-6   cachem_1.1.0      xfun_0.50        
## [29] quadprog_1.5-8    sass_0.4.9        sys_3.4.3         cli_3.6.4        
## [33] withr_3.0.2       magrittr_2.0.3    digest_0.6.37     grid_4.4.2       
## [37] lifecycle_1.0.4   nlme_3.1-167      vctrs_0.6.5       evaluate_1.0.3   
## [41] glue_1.8.0        farver_2.1.2      codetools_0.2-20  ggseqlogo_0.2    
## [45] buildtools_1.0.0  colorspace_2.1-1  tools_4.4.2       pkgconfig_2.0.3  
## [49] htmltools_0.5.8.1

References

Jukes, Thomas H., and Charles R. Cantor. 1969. “{CHAPTER} 24 - Evolution of Protein Molecules.” In Mammalian Protein Metabolism, edited by H. N. Munro, 21–132. Academic Press.

Nguyen, Lam-Tung, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. 2015. “IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74. https://doi.org/10.1093/molbev/msu300.

Schliep, Klaus Peter. 2011. “Phangorn: Phylogenetic Analysis in R.” Bioinformatics 27 (4): 592–93. https://doi.org/10.1093/bioinformatics/btq706.

Vos, R. A. 2003. “Accelerated Likelihood Surface Exploration: The Likelihood Ratchet.” Systematic Biology 52 (3): 368–73.