Title: | Compositional Data Analysis |
---|---|
Description: | Regression, classification, contour plots, hypothesis testing and fitting of distributions for compositional data are some of the functions included. We further include functions for percentages (or proportions). The standard textbook for such data is John Aitchison's (1986) "The statistical analysis of compositional data". Relevant papers include: a) Tsagris M.T., Preston S. and Wood A.T.A. (2011). "A data-based power transformation for compositional data". Fourth International International Workshop on Compositional Data Analysis. <doi:10.48550/arXiv.1106.1451> b) Tsagris M. (2014). "The k-NN algorithm for compositional data: a revised approach with and without zero values present". Journal of Data Science, 12(3): 519--534. <doi:10.6339/JDS.201407_12(3).0008>. c) Tsagris M. (2015). "A novel, divergence based, regression for compositional data". Proceedings of the 28th Panhellenic Statistics Conference, 15-18 April 2015, Athens, Greece, 430--444. <doi:10.48550/arXiv.1511.07600>. d) Tsagris M. (2015). "Regression analysis with compositional data containing zero values". Chilean Journal of Statistics, 6(2): 47--57. <https://soche.cl/chjs/volumes/06/02/Tsagris(2015).pdf>. e) Tsagris M., Preston S. and Wood A.T.A. (2016). "Improved supervised classification for compositional data using the alpha-transformation". Journal of Classification, 33(2): 243--261. <doi:10.1007/s00357-016-9207-5>. f) Tsagris M., Preston S. and Wood A.T.A. (2017). "Nonparametric hypothesis testing for equality of means on the simplex". Journal of Statistical Computation and Simulation, 87(2): 406--422. <doi:10.1080/00949655.2016.1216554>. g) Tsagris M. and Stewart C. (2018). "A Dirichlet regression model for compositional data with zeros". Lobachevskii Journal of Mathematics, 39(3): 398--412. <doi:10.1134/S1995080218030198>. h) Alenazi A. (2019). "Regression for compositional data with compositional data as predictor variables with or without zero values". Journal of Data Science, 17(1): 219--238. <doi:10.6339/JDS.201901_17(1).0010>. i) Tsagris M. and Stewart C. (2020). "A folded model for compositional data analysis". Australian and New Zealand Journal of Statistics, 62(2): 249--277. <doi:10.1111/anzs.12289>. j) Alenazi A.A. (2022). "f-divergence regression models for compositional data". Pakistan Journal of Statistics and Operation Research, 18(4): 867--882. <doi:10.18187/pjsor.v18i4.3969>. k) Tsagris M. and Stewart C. (2022). "A Review of Flexible Transformations for Modeling Compositional Data". In Advances and Innovations in Statistics and Data Science, pp. 225--234. <doi:10.1007/978-3-031-08329-7_10>. l) Alenazi A. (2023). "A review of compositional data analysis and recent advances". Communications in Statistics--Theory and Methods, 52(16): 5535--5567. <doi:10.1080/03610926.2021.2014890>. m) Tsagris M., Alenazi A. and Stewart C. (2023). "Flexible non-parametric regression models for compositional response data with zeros". Statistics and Computing, 33(106). <doi:10.1007/s11222-023-10277-5>. n) Tsagris. M. (2024). "Constrained least squares simplicial-simplicial regression". <doi:10.48550/arXiv.2403.19835>. |
Authors: | Michail Tsagris [aut, cre], Giorgos Athineou [aut], Abdulaziz Alenazi [ctb], Christos Adam [ctb] |
Maintainer: | Michail Tsagris <[email protected]> |
License: | GPL (>= 2) |
Version: | 7.2 |
Built: | 2024-12-04 22:01:00 UTC |
Source: | CRAN |
A Collection of Functions for Compositional Data Analysis.
Package: | Compositional |
Type: | Package |
Version: | 7.2 |
Date: | 2024-12-04 |
License: | GPL-2 |
Michail Tsagris <[email protected]>
Acknowledgments:
Michail Tsagris would like to express his acknowledgments to Professor Andy Wood and Professor Simon Preston from the university of Nottingham for being his supervisors during his PhD in compositional data analysis.
We would also like to express our acknowledgments to Profesor Kurt Hornik (and also the rest of the R core team) for his help with this package.
Manos Papadakis, undergraduate student in the department of computer science, university of Crete, is also acknowledged for his programming tips.
Ermanno Affuso from the university of South Alabama suggested that I have a default value in the function mkde
.
Van Thang Hoang from Hasselt university spotted a bug in the function js.compreg
.
Claudia Wehrhahn Cortes spotted a bug in the function diri.reg
.
Philipp Kynast from Bruker Daltonik GmbH found a mistake in the function mkde
which is now fixed.
Jasmine Heyse from the university of Ghent spotted a bug in the function kl.compreg
which is now fixed.
Magne Neby suggested to add names in the covariance matrix of the divergence based regression models.
John Barry from the Centre for Environment, Fisheries, and Aquaculture Science (UK) suggested that I should add more explanation in the function diri.est
. I hope it is clearer now.
Charlotte Fabri and Laura Byrne spotted a possible problem in the function zadr
.
Levi Bankston found a bug in the bootstrap version of the function kl.compreg
.
Sucharitha Dodamgodage suggested to add an extra case in the function dirimean.test
.
Loic Mangnier found a bug in the function lc.glm
which is now fixed and also became faster.
Ravi Varadhan found a bug in diri.reg
and he is acknowledged for that.
Michail Tsagris [email protected], Giorgos Athineou <[email protected]>, Abdulaziz Alenazi <[email protected]> and Christos Adam [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Aitchison's test for two mean vectors and/or covariance matrices.
ait.test(x1, x2, type = 1, alpha = 0.05)
ait.test(x1, x2, type = 1, alpha = 0.05)
x1 |
A matrix containing the compositional data of the first sample. Zeros are not allowed. |
x2 |
A matrix containing the compositional data of the second sample. Zeros are not allowed. |
type |
The type of hypothesis test to perform. Type=1 refers to testing the equality of the mean vectors and the covariance matrices. Type=2 refers to testing the equality of the covariance matrices. Type=2 refers to testing the equality of the mean vectors. |
alpha |
The significance level, set to 0.05 by default. |
The test is described in Aitchison (2003). See the references for more information.
A vector with the test statistic, the p-value, the critical value and the degrees of freedom of the chi-square distribution.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
John Aitchison (2003). The Statistical Analysis of Compositional Data, p. 153-157. Blackburn Press.
x1 <- as.matrix(iris[1:50, 1:4]) x1 <- x1 / rowSums(x1) x2 <- as.matrix(iris[51:100, 1:4]) x2 <- x2 / rowSums(x2) ait.test(x1, x2, type = 1) ait.test(x1, x2, type = 2) ait.test(x1, x2, type = 3)
x1 <- as.matrix(iris[1:50, 1:4]) x1 <- x1 / rowSums(x1) x2 <- as.matrix(iris[51:100, 1:4]) x2 <- x2 / rowSums(x2) ait.test(x1, x2, type = 1) ait.test(x1, x2, type = 2) ait.test(x1, x2, type = 3)
All pairwise additive log-ratio transformations.
alr.all(x)
alr.all(x)
x |
A numerical matrix with the compositional data. |
The additive log-ratio transformation with the first component being the commn divisor is applied. Then all the other pairwise log-ratios are computed and added next to each column. For example, divide by the first component, then divide by the second component and so on. This means that no zeros are allowed.
A matrix with all pairwise alr transformed data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) y <- alr.all(x)
x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) y <- alr.all(x)
-generalised correlations between two compositional datasets
-generalised correlations between two compositional datasets.
acor(y, x, a, type = "dcor")
acor(y, x, a, type = "dcor")
y |
A matrix with the compositional data. |
x |
A matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero
values are present it has to be greater than 0. If |
type |
The type of correlation to compute, the distance correlation ("edist"), the canonical correlation ("cancor") or "both". |
The -transformation is applied to each composition and then the distance correlation
or the canonical correlation is computed. If one value of
is supplied the type="cancor"
will return all eigenvalues. If more than one values of
are provided then the first
eigenvalue only will be returned.
A vector or a matrix depending on the length of the values of
and the type of the correlation to be computed.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
G.J. Szekely, M.L. Rizzo and N. K. Bakirov (2007). Measuring and Testing Independence by Correlation of Distances. Annals of Statistics, 35(6): 2769-2794.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
acor.tune, aeqdist.etest, alfa, alfa.profile
y <- rdiri(30, runif(3) ) x <- rdiri(30, runif(4) ) acor(y, x, a = 0.4)
y <- rdiri(30, runif(3) ) x <- rdiri(30, runif(4) ) acor(y, x, a = 0.4)
ANOVA for the log-contrast GLM versus the uncostrained GLM.
lcglm.aov(mod0, mod1)
lcglm.aov(mod0, mod1)
mod0 |
The log-contrast GLM. The object returned by |
mod1 |
The unconstrained GLM. The object returned by |
A chi-square test is performed to test the zero-to-sum constraints of the regression coefficients.
A vector with two values, the chi-square test statistic and its associated p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- rbinom(150, 1, 0.5) x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod0 <- lc.glm(y, x) mod1 <- ulc.glm(y, x) lcglm.aov(mod0, mod1)
y <- rbinom(150, 1, 0.5) x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod0 <- lc.glm(y, x) mod1 <- ulc.glm(y, x) lcglm.aov(mod0, mod1)
ANOVA for the log-contrast regression versus the uncostrained linear regression.
lcreg.aov(mod0, mod1)
lcreg.aov(mod0, mod1)
mod0 |
The log-contrast regression model. The object returned by |
mod1 |
The unconstrained linear regression model. The object returned by |
An F-test is performed to test the zero-to-sum constraints of the regression coefficients.
A vector with two values, the F test statistic and its associated p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
lc.reg, ulc.reg, alfa.pcr, alfa.knn.reg
y <- iris[, 1] x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod0 <- lc.reg(y, x) mod1 <- ulc.reg(y, x) lcreg.aov(mod0, mod1)
y <- iris[, 1] x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod0 <- lc.reg(y, x) mod1 <- ulc.reg(y, x) lcreg.aov(mod0, mod1)
Beta regression.
beta.reg(y, x, xnew = NULL)
beta.reg(y, x, xnew = NULL)
y |
The response variable. It must be a numerical vector with proportions excluding 0 and 1. |
x |
The indendent variable(s). It can be a vector, a matrix or a dataframe with continuous only variables, a data frame with mixed or only categorical variables. |
xnew |
If you have new values for the predictor variables (dataset) whose response values you want to predict insert them here. |
Beta regression is fitted.
A list including:
phi |
The estimated precision parameter. |
info |
A matrix with the estimated regression parameters, their standard errors, Wald statistics and associated p-values. |
loglik |
The log-likelihood of the regression model. |
est |
The estimated values if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ferrari S.L.P. and Cribari-Neto F. (2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics, 31(7): 799-815.
y <- rbeta(300, 3, 5) x <- matrix( rnorm(300 * 2), ncol = 2) beta.reg(y, x)
y <- rbeta(300, 3, 5) x <- matrix( rnorm(300 * 2), ncol = 2) beta.reg(y, x)
Column-wise MLE of some univariate distributions.
colbeta.est(x, tol = 1e-07, maxiters = 100, parallel = FALSE) collogitnorm.est(x) colunitweibull.est(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colzilogitnorm.est(x)
colbeta.est(x, tol = 1e-07, maxiters = 100, parallel = FALSE) collogitnorm.est(x) colunitweibull.est(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colzilogitnorm.est(x)
x |
A numerical matrix with data. Each column refers to a different vector of observations of the same distribution. The values must by percentages, exluding 0 and 1, |
tol |
The tolerance value to terminate the Newton-Fisher algorithm. |
maxiters |
The maximum number of iterations to implement. |
parallel |
Do you want to calculations to take place in parallel? The default value is FALSE |
For each column, the same distribution is fitted and its parameters and log-likelihood are computed.
A matrix with two, three or four columns. The first one, two or three columns contain the parameter(s) of the distribution, while the last column contains the relevant log-likelihood.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
N.L. Johnson, S. Kotz & N. Balakrishnan (1994). Continuous Univariate Distributions, Volume 1 (2nd Edition).
N.L. Johnson, S. Kotz & N. Balakrishnan (1970). Distributions in statistics: continuous univariate distributions, Volume 2.
J. Mazucheli, A. F. B. Menezes, L. B. Fernandes, R. P. de Oliveira & M. E. Ghitany (2020). The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. Journal of Applied Statistics, DOI:10.1080/02664763.2019.1657813.
x <- matrix( rbeta(200, 3, 4), ncol = 4 ) a <- colbeta.est(x)
x <- matrix( rbeta(200, 3, 4), ncol = 4 ) a <- colbeta.est(x)
Contour plot of mixtures of Dirichlet distributions in .
mixdiri.contour(a, prob, n = 100, x = NULL, cont.line = FALSE)
mixdiri.contour(a, prob, n = 100, x = NULL, cont.line = FALSE)
a |
A matrix where each row contains the parameters of each Dirichlet disctribution. |
prob |
A vector with the mixing probabilities. |
n |
The number of grid points to consider over which the density is calculated. |
x |
This is either NULL (no data) or contains a 3 column matrix with compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The user can plot only the contour lines of a Dirichlet with a given vector of parameters, or can also add the relevant data should he/she wish to.
A ternary diagram with the points and the Dirichlet contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
diri.contour, gendiri.contour, compnorm.contour,
comp.kerncontour, mix.compnorm.contour,
diri.nr, dda
a <- matrix( c(12, 30, 45, 32, 50, 16), byrow = TRUE,ncol = 3) prob <- c(0.5, 0.5) mixdiri.contour(a, prob)
a <- matrix( c(12, 30, 45, 32, 50, 16), byrow = TRUE,ncol = 3) prob <- c(0.5, 0.5) mixdiri.contour(a, prob)
multivariate normal in
Contour plot of the multivariate normal in
.
alfa.contour(m, s, a, n = 100, x = NULL, cont.line = FALSE)
alfa.contour(m, s, a, n = 100, x = NULL, cont.line = FALSE)
m |
The mean vector of the |
s |
The covariance matrix of the |
a |
The value of a for the |
n |
The number of grid points to consider over which the density is calculated. |
x |
This is either NULL (no data) or contains a 3 column matrix with compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The -transformation is applied to the compositional data and then for a grid of points within the 2-dimensional simplex, the density of the
multivariate normal is calculated and the contours are plotted.
The contour plot of the multivariate normal appears.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Tsagris M. and Stewart C. (2022). A Review of Flexible Transformations for Modeling Compositional Data. In Advances and Innovations in Statistics and Data Science, pp. 225–234. https://link.springer.com/chapter/10.1007/978-3-031-08329-7_10
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
folded.contour, compnorm.contour, diri.contour, mix.compnorm.contour, bivt.contour, skewnorm.contour
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) a <- a.est(x)$best m <- colMeans(alfa(x, a)$aff) s <- cov(alfa(x, a)$aff) alfa.contour(m, s, a)
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) a <- a.est(x)$best m <- colMeans(alfa(x, a)$aff) s <- cov(alfa(x, a)$aff) alfa.contour(m, s, a)
-folded model in
Contour plot of the -folded model in
.
folded.contour(mu, su, p, a, n = 100, x = NULL, cont.line = FALSE)
folded.contour(mu, su, p, a, n = 100, x = NULL, cont.line = FALSE)
mu |
The mean vector of the folded model. |
su |
The covariance matrix of the folded model. |
p |
The probability inside the simplex of the folded model. |
a |
The value of a for the |
n |
The number of grid points to consider over which the density is calculated. |
x |
This is either NULL (no data) or contains a 3 column matrix with compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The -transformation is applied to the compositional data and then for a grid
of points within the 2-dimensional simplex the folded model's density is calculated and
the contours are plotted.
The contour plot of the folded model appears.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Tsagris M. and Stewart C. (2022). A Review of Flexible Transformations for Modeling Compositional Data. In Advances and Innovations in Statistics and Data Science, pp. 225–234. https://link.springer.com/chapter/10.1007/978-3-031-08329-7_10
Tsagris M. and Stewart C. (2020). A folded model for compositional data analysis. Australian and New Zealand Journal of Statistics, 62(2): 249-277. https://arxiv.org/pdf/1802.07330.pdf
alfa.contour, compnorm.contour, diri.contour, mix.compnorm.contour,
bivt.contour, skewnorm.contour
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) a <- a.est(x)$best mod <- alpha.mle(x, a) folded.contour(mod$mu, mod$su, mod$p, a)
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) a <- a.est(x)$best mod <- alpha.mle(x, a) folded.contour(mod$mu, mod$su, mod$p, a)
Contour plot of the Dirichlet distribution in .
diri.contour(a, n = 100, x = NULL, cont.line = FALSE)
diri.contour(a, n = 100, x = NULL, cont.line = FALSE)
a |
A vector with three elements corresponding to the 3 (estimated) parameters. |
n |
The number of grid points to consider over which the density is calculated. |
x |
This is either NULL (no data) or contains a 3 column matrix with compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The user can plot only the contour lines of a Dirichlet with a given vector of parameters, or can also add the relevant data should he/she wish to.
A ternary diagram with the points and the Dirichlet contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
mixdiri.contour, gendiri.contour, compnorm.contour,
comp.kerncontour, mix.compnorm.contour
x <- as.matrix( iris[, 1:3] ) x <- x / rowSums(x) diri.contour( a = c(3, 4, 2) )
x <- as.matrix( iris[, 1:3] ) x <- x / rowSums(x) diri.contour( a = c(3, 4, 2) )
Contour plot of the Flexible Dirichlet distribution in .
fd.contour(alpha, prob, tau, n = 100, x = NULL, cont.line = FALSE)
fd.contour(alpha, prob, tau, n = 100, x = NULL, cont.line = FALSE)
alpha |
A vector of the non-negative |
prob |
A vector of the clusters' probabilities. It must sum to one. |
tau |
The non-negative scalar |
n |
The number of grid points to consider over which the density is calculated. |
x |
This is either NULL (no data) or contains a 3 column matrix with compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The user can plot only the contour lines of a Dirichlet with a given vector of parameters, or can also add the relevant data should they wish to.
A ternary diagram with the points and the Flexible Dirichlet contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Ongaro A. and Migliorati S. (2013). A generalization of the Dirichlet distribution. Journal of Multivariate Analysis, 114, 412–426.
Migliorati S., Ongaro A. and Monti G. S. (2017). A structured Dirichlet mixture model for compositional data: inferential and applicative issues. Statistics and Computing, 27, 963–983.
compnorm.contour, folded.contour, bivt.contour,
comp.kerncontour, mix.compnorm.contour
fd.contour(alpha = c(10, 11, 12), prob = c(0.25, 0.25, 0.5), tau = 4)
fd.contour(alpha = c(10, 11, 12), prob = c(0.25, 0.25, 0.5), tau = 4)
Contour plot of the Gaussian mixture model in .
mix.compnorm.contour(mod, type = "alr", n = 100, x = NULL, cont.line = FALSE)
mix.compnorm.contour(mod, type = "alr", n = 100, x = NULL, cont.line = FALSE)
mod |
An object containing the output of a |
type |
The type of trasformation used, either the additive log-ratio ("alr"), the isometric log-ratio ("ilr") or the pivot coordinate ("pivot") transformation. |
n |
The number of grid points to consider over which the density is calculated. |
x |
A matrix with the compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The contour plot of a Gaussian mixture model is plotted. For this you need the (fitted) model.
A ternary plot with the data and the contour lines of the fitted Gaussian mixture model.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Ryan P. Browne, Aisha ElSherbiny and Paul D. McNicholas (2015). R package mixture: Mixture Models for Clustering and Classification
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
mix.compnorm, bic.mixcompnorm, diri.contour
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) mod <- mix.compnorm(x, 3, model = "EII") mix.compnorm.contour(mod, "alr")
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) mod <- mix.compnorm(x, 3, model = "EII") mix.compnorm.contour(mod, "alr")
Contour plot of the generalised Dirichlet distribution in .
gendiri.contour(a, b, n = 100, x = NULL, cont.line = FALSE)
gendiri.contour(a, b, n = 100, x = NULL, cont.line = FALSE)
a |
A vector with three elements corresponding to the 3 (estimated) shape parameter values. |
b |
A vector with three elements corresponding to the 3 (estimated) scale parameter values. |
n |
The number of grid points to consider over which the density is calculated. |
x |
This is either NULL (no data) or contains a 3 column matrix with compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The user can plot only the contour lines of a Dirichlet with a given vector of parameters, or can also add the relevant data should he/she wish to.
A ternary diagram with the points and the Dirichlet contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
diri.contour, mixdiri.contour, compnorm.contour,
comp.kerncontour, mix.compnorm.contour
x <- as.matrix( iris[, 1:3] ) x <- x / rowSums(x) gendiri.contour( a = c(3, 4, 2), b = c(1, 2, 3) )
x <- as.matrix( iris[, 1:3] ) x <- x / rowSums(x) gendiri.contour( a = c(3, 4, 2), b = c(1, 2, 3) )
Contour plot of the kernel density estimate in .
comp.kerncontour(x, type = "alr", n = 50, cont.line = FALSE)
comp.kerncontour(x, type = "alr", n = 50, cont.line = FALSE)
x |
A matrix with the compositional data. It has to be a 3 column matrix. |
type |
This is either "alr" or "ilr", corresponding to the additive and the isometric log-ratio transformation respectively. |
n |
The number of grid points to consider, over which the density is calculated. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The alr or the ilr transformation are applied to the compositional data. Then, the optimal bandwidth using maximum likelihood cross-validation is chosen. The multivariate normal kernel density is calculated for a grid of points. Those points are the points on the 2-dimensional simplex. Finally the contours are plotted.
A ternary diagram with the points and the kernel contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
M.P. Wand and M.C. Jones (1995). Kernel smoothing, CrC Press.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
diri.contour, mix.compnorm.contour, bivt.contour, compnorm.contour
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) comp.kerncontour(x, type = "alr", n = 20) comp.kerncontour(x, type = "ilr", n = 20)
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) comp.kerncontour(x, type = "alr", n = 20) comp.kerncontour(x, type = "ilr", n = 20)
Contour plot of the normal distribution in .
compnorm.contour(m, s, type = "alr", n = 100, x = NULL, cont.line = FALSE)
compnorm.contour(m, s, type = "alr", n = 100, x = NULL, cont.line = FALSE)
m |
The mean vector. |
s |
The covariance matrix. |
type |
The type of trasformation used, either the additive log-ratio ("alr"), the isometric log-ratio ("ilr") or the pivot coordinate ("pivot") transformation. |
n |
The number of grid points to consider over which the density is calculated. |
x |
This is either NULL (no data) or contains a 3 column matrix with compositional data. |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The alr or the ilr transformation is applied to the compositional data at first. Then for a grid of points within the 2-dimensional simplex the bivariate normal density is calculated and the contours are plotted along with the points.
A ternary diagram with the points (if appear = TRUE) and the bivariate normal contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
diri.contour, mix.compnorm.contour, bivt.contour, skewnorm.contour
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) y <- Compositional::alr(x) m <- colMeans(y) s <- cov(y) compnorm.contour(m, s)
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) y <- Compositional::alr(x) m <- colMeans(y) s <- cov(y) compnorm.contour(m, s)
Contour plot of the skew skew-normal distribution in .
skewnorm.contour(x, type = "alr", n = 100, appear = TRUE, cont.line = FALSE)
skewnorm.contour(x, type = "alr", n = 100, appear = TRUE, cont.line = FALSE)
x |
A matrix with the compositional data. It has to be a 3 column matrix. |
type |
This is either "alr" or "ilr", corresponding to the additive and the isometric log-ratio transformation respectively. |
n |
The number of grid points to consider over which the density is calculated. |
appear |
Should the available data appear on the ternary plot (TRUE) or not (FALSE)? |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The alr or the ilr transformation is applied to the compositional data at first. Then for a grid of points within the 2-dimensional simplex the bivariate skew skew-normal density is calculated and the contours are plotted along with the points.
A ternary diagram with the points (if appear = TRUE) and the bivariate skew skew-normal contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Azzalini A. and Valle A. D. (1996). The multivariate skew-skewnormal distribution. Biometrika 83(4):715-726.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
diri.contour, mix.compnorm.contour, bivt.contour, compnorm.contour
x <- as.matrix(iris[51:100, 1:3]) x <- x / rowSums(x) skewnorm.contour(x)
x <- as.matrix(iris[51:100, 1:3]) x <- x / rowSums(x) skewnorm.contour(x)
Contour plot of the t distribution in .
bivt.contour(x, type = "alr", n = 100, appear = TRUE, cont.line = FALSE)
bivt.contour(x, type = "alr", n = 100, appear = TRUE, cont.line = FALSE)
x |
A matrix with compositional data. It has to be a 3 column matrix. |
type |
This is either "alr" or "ilr", corresponding to the additive and the isometric log-ratio transformation respectively. |
n |
The number of grid points to consider over which the density is calculated. |
appear |
Should the available data appear on the ternary plot (TRUE) or not (FALSE)? |
cont.line |
Do you want the contour lines to appear? If yes, set this TRUE. |
The alr or the ilr transformation is applied to the compositional data at first and the location, scatter and degrees of freedom of the bivariate t distribution are computed. Then for a grid of points within the 2-dimensional simplex the bivariate t density is calculated and the contours are plotted along with the points.
A ternary diagram with the points (if appear = TRUE) and the bivariate t contour lines.
Michail Tsagris and Christos Adam.
R implementation and documentation: Michail Tsagris [email protected] and Christos Adam [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
diri.contour, mix.compnorm.contour, compnorm.contour, skewnorm.contour
x <- as.matrix( iris[, 1:3] ) x <- x / rowSums(x) bivt.contour(x) bivt.contour(x, type = "ilr")
x <- as.matrix( iris[, 1:3] ) x <- x / rowSums(x) bivt.contour(x) bivt.contour(x, type = "ilr")
Cross validation for some compositional regression models.
cv.comp.reg(y, x, type = "comp.reg", nfolds = 10, folds = NULL, seed = NULL)
cv.comp.reg(y, x, type = "comp.reg", nfolds = 10, folds = NULL, seed = NULL)
y |
A matrix with compositional data. Zero values are allowed for some regression models. |
x |
The predictor variable(s). |
type |
This can be one of the following: "comp.reg", "robust", "kl.compreg", "js.compreg", "diri.reg" or "zadr". |
nfolds |
The number of folds to be used. This is taken into consideration only if the folds argument is not supplied. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
If seed is TRUE the results will always be the same. |
A k-fold cross validation for a compositional regression model is performed.
A list including:
runtime |
The runtime of the cross-validation procedure. |
kl |
The Kullback-Leibler divergences for all runs. |
js |
The Jensen-Shannon divergences for all runs. |
perf |
The average Kullback-Leibler divergence and average Jensen-Shannon divergence. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
comp.reg, kl.compreg, compppr.tune, aknnreg.tune
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- cv.comp.reg(y, x)
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- cv.comp.reg(y, x)
-k-NN regression with compositional predictor variables
Cross validation for the -k-NN regression with compositional predictor variables.
alfaknnreg.tune(y, x, a = seq(-1, 1, by = 0.1), k = 2:10, nfolds = 10, apostasi = "euclidean", method = "average", folds = NULL, seed = NULL, graph = FALSE)
alfaknnreg.tune(y, x, a = seq(-1, 1, by = 0.1), k = 2:10, nfolds = 10, apostasi = "euclidean", method = "average", folds = NULL, seed = NULL, graph = FALSE)
y |
The response variable, a numerical vector. |
x |
A matrix with the available compositional data. Zeros are allowed. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0.
If |
k |
The number of nearest neighbours to consider. It can be a single number or a vector. |
nfolds |
The number of folds. Set to 10 by default. |
apostasi |
The type of distance to use, either "euclidean" or "manhattan". |
method |
If you want to take the average of the reponses of the k closest observations, type "average". For the median, type "median" and for the harmonic mean, type "harmonic". |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
If seed is TRUE the results will always be the same. |
graph |
If graph is TRUE (default value) a filled contour plot will appear. |
A k-fold cross validation for the -k-NN regression for compositional response data is performed.
A list including:
mspe |
The mean square error of prediction. |
performance |
The minimum mean square error of prediction. |
opt_a |
The optimal value of |
opt_k |
The optimal value of k. |
runtime |
The runtime of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Stewart C. (2023). Flexible non-parametric regression models for compositional response data with zeros. Statistics and Computing, 33(106).
https://link.springer.com/article/10.1007/s11222-023-10277-5
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- fgl[, 1] mod <- alfaknnreg.tune(y, x, a = seq(0.2, 0.4, by = 0.1), k = 2:4, nfolds = 5)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- fgl[, 1] mod <- alfaknnreg.tune(y, x, a = seq(0.2, 0.4, by = 0.1), k = 2:4, nfolds = 5)
-k-NN regression with compositional response data
Cross validation for the -k-NN regression with compositional response data.
aknnreg.tune(y, x, a = seq(0.1, 1, by = 0.1), k = 2:10, apostasi = "euclidean", nfolds = 10, folds = NULL, seed = NULL, rann = FALSE)
aknnreg.tune(y, x, a = seq(0.1, 1, by = 0.1), k = 2:10, apostasi = "euclidean", nfolds = 10, folds = NULL, seed = NULL, rann = FALSE)
y |
A matrix with the compositional response data. Zeros are allowed. |
x |
A matrix with the available predictor variables. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0.
If |
k |
The number of nearest neighbours to consider. It can be a single number or a vector. |
apostasi |
The type of distance to use, either "euclidean" or "manhattan". |
nfolds |
The number of folds. Set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
rann |
If you have large scale datasets and want a faster k-NN search, you can use kd-trees implemented in the R package "Rnanoflann". In this case you must set this argument equal to TRUE. Note however, that in this case, the only available distance is by default "euclidean". |
A k-fold cross validation for the -k-NN regression for compositional response data is performed.
A list including:
kl |
The Kullback-Leibler divergence for all combinations of |
js |
The Jensen-Shannon divergence for all combinations of |
klmin |
The minimum Kullback-Leibler divergence. |
jsmin |
The minimum Jensen-Shannon divergence. |
kl.alpha |
The optimal |
kl.k |
The optimal |
js.alpha |
The optimal |
js.k |
The optimal |
runtime |
The runtime of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Stewart C. (2023). Flexible non-parametric regression models for compositional response data with zeros. Statistics and Computing, 33(106).
https://link.springer.com/article/10.1007/s11222-023-10277-5
aknn.reg, akernreg.tune, akern.reg, alfa.rda, alfa.fda
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- aknnreg.tune(y, x, a = c(0.4, 0.6), k = 2:4, nfolds = 5)
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- aknnreg.tune(y, x, a = c(0.4, 0.6), k = 2:4, nfolds = 5)
-kernel regression with compositional response data
Cross validation for the -kernel regression with compositional response data.
akernreg.tune(y, x, a = seq(0.1, 1, by = 0.1), h = seq(0.1, 1, length = 10), type = "gauss", nfolds = 10, folds = NULL, seed = NULL)
akernreg.tune(y, x, a = seq(0.1, 1, by = 0.1), h = seq(0.1, 1, length = 10), type = "gauss", nfolds = 10, folds = NULL, seed = NULL)
y |
A matrix with the compositional response data. Zeros are allowed. |
x |
A matrix with the available predictor variables. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
h |
A vector with the bandwidth value(s) to consider. |
type |
The type of kernel to use, "gauss" or "laplace". |
nfolds |
The number of folds. Set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
A k-fold cross validation for the -kernel regression for compositional response data is performed.
A list including:
kl |
The Kullback-Leibler divergence for all combinations of |
js |
The Jensen-Shannon divergence for all combinations of |
klmin |
The minimum Kullback-Leibler divergence. |
jsmin |
The minimum Jensen-Shannon divergence. |
kl.alpha |
The optimal |
kl.h |
The optimal |
js.alpha |
The optimal |
js.h |
The optimal |
runtime |
The runtime of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Stewart C. (2023). Flexible non-parametric regression models for compositional response data with zeros. Statistics and Computing, 33(106).
https://link.springer.com/article/10.1007/s11222-023-10277-5
akern.reg, aknnreg.tune, aknn.reg, alfa.rda, alfa.fda
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- akernreg.tune(y, x, a = c(0.4, 0.6), h = c(0.1, 0.2), nfolds = 5)
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- akernreg.tune(y, x, a = c(0.4, 0.6), h = c(0.1, 0.2), nfolds = 5)
Cross validation for the kernel regression with Euclidean response data.
kernreg.tune(y, x, h = seq(0.1, 1, length = 10), type = "gauss", nfolds = 10, folds = NULL, seed = NULL, graph = FALSE, ncores = 1)
kernreg.tune(y, x, h = seq(0.1, 1, length = 10), type = "gauss", nfolds = 10, folds = NULL, seed = NULL, graph = FALSE, ncores = 1)
y |
A matrix or a vector with the Euclidean response. |
x |
A matrix with the available predictor variables. |
h |
A vector with the bandwidth value(s) |
type |
The type of kernel to use, "gauss" or "laplace". |
nfolds |
The number of folds. Set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If graph is TRUE (default value) a plot will appear. |
ncores |
The number of cores to use. Default value is 1. |
A k-fold cross validation for the kernel regression with a euclidean response is performed.
A list including:
mspe |
The mean squared prediction error (MSPE) for each fold and value of |
h |
The optimal |
performance |
The minimum MSPE. |
runtime |
The runtime of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Wand M. P. and Jones M. C. (1994). Kernel smoothing. CRC press.
kern.reg, aknnreg.tune, aknn.reg
y <- iris[, 1] x <- iris[, 2:4] mod <- kernreg.tune(y, x, h = c(0.1, 0.2, 0.3) )
y <- iris[, 1] x <- iris[, 2:4] mod <- kernreg.tune(y, x, h = c(0.1, 0.2, 0.3) )
-transformation
Cross validation for the regularised and flexible discriminant analysis with compositional data using the -transformation.
alfarda.tune(x, ina, a = seq(-1, 1, by = 0.1), nfolds = 10, gam = seq(0, 1, by = 0.1), del = seq(0, 1, by = 0.1), ncores = 1, folds = NULL, stratified = TRUE, seed = NULL) alfafda.tune(x, ina, a = seq(-1, 1, by = 0.1), nfolds = 10, folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE)
alfarda.tune(x, ina, a = seq(-1, 1, by = 0.1), nfolds = 10, gam = seq(0, 1, by = 0.1), del = seq(0, 1, by = 0.1), ncores = 1, folds = NULL, stratified = TRUE, seed = NULL) alfafda.tune(x, ina, a = seq(-1, 1, by = 0.1), nfolds = 10, folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE)
x |
A matrix with the available compositional data. Zeros are allowed. |
ina |
A group indicator variable for the avaiable data. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
nfolds |
The number of folds. Set to 10 by default. |
gam |
A vector of values between 0 and 1. It is the weight of the pooled covariance and the diagonal matrix. |
del |
A vector of values between 0 and 1. It is the weight of the LDA and QDA. |
ncores |
The number of cores to use. If it is more than 1 parallel computing is performed. It is advisable to use it if you have many observations and or many variables, otherwise it will slow down th process. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
stratified |
Do you want the folds to be created in a stratified way? TRUE or FALSE. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If graph is TRUE (default value) a plot will appear. |
A k-fold cross validation is performed.
For the alfa.rda a list including:
res |
The estimated optimal rate and the best values of |
percent |
For the best value of |
se |
The estimated standard errors of the "percent" matrix. |
runtime |
The runtime of the cross-validation procedure. |
For the alfa.fda a graph (if requested) with the estimated performance for each value of and a list including:
per |
The performance of the fda in each fold for each value of |
performance |
The average performance for each value of |
opt_a |
The optimal value of |
runtime |
The runtime of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Friedman Jerome, Trevor Hastie and Robert Tibshirani (2009). The elements of statistical learning, 2nd edition. Springer, Berlin
Tsagris M.T., Preston S. and Wood A.T.A. (2016).
Improved classification for compositional data using the -transformation.
Jounal of Classification, 33(2):243-261.
Hastie, Tibshirani and Buja (1994). Flexible Disriminant Analysis by Optimal Scoring. Journal of the American Statistical Association, 89(428):1255-1270.
alfa.rda, alfanb.tune, cv.dda, compknn.tune cv.compnb
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) ina <- fgl[, 10] moda <- alfarda.tune(x, ina, a = seq(0.7, 1, by = 0.1), nfolds = 10, gam = seq(0.1, 0.3, by = 0.1), del = seq(0.1, 0.3, by = 0.1) )
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) ina <- fgl[, 10] moda <- alfarda.tune(x, ina, a = seq(0.7, 1, by = 0.1), nfolds = 10, gam = seq(0.1, 0.3, by = 0.1), del = seq(0.1, 0.3, by = 0.1) )
Cross validation for the ridge regression is performed. There is an option for the GCV criterion which is automatic.
ridge.tune(y, x, nfolds = 10, lambda = seq(0, 2, by = 0.1), folds = NULL, ncores = 1, seed = NULL, graph = FALSE)
ridge.tune(y, x, nfolds = 10, lambda = seq(0, 2, by = 0.1), folds = NULL, ncores = 1, seed = NULL, graph = FALSE)
y |
A numeric vector containing the values of the target variable. If the values are proportions or percentages, i.e. strictly within 0 and 1 they are mapped into R using the logit transformation. |
x |
A numeric matrix containing the variables. |
nfolds |
The number of folds in the cross validation. |
lambda |
A vector with the a grid of values of |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
ncores |
The number of cores to use. If it is more than 1 parallel computing is performed. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If graph is set to TRUE the performances for each fold as a function of the |
A k-fold cross validation is performed. This function is used by alfaridge.tune
.
A list including:
msp |
The performance of the ridge regression for every fold. |
mspe |
The values of the mean prediction error for each value of |
lambda |
The value of |
performance |
The minimum MSPE. |
runtime |
The time required by the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Giorgos Athineou <[email protected]> and Michail Tsagris [email protected].
Hoerl A.E. and R.W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55-67.
Brown P. J. (1994). Measurement, Regression and Calibration. Oxford Science Publications.
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) ridge.tune( y, x, nfolds = 10, lambda = seq(0, 2, by = 0.1), graph = TRUE )
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) ridge.tune( y, x, nfolds = 10, lambda = seq(0, 2, by = 0.1), graph = TRUE )
-transformation
Cross validation for the ridge regression is performed.
There is an option for the GCV criterion which is automatic. The predictor variables are compositional data and the -transformation is applied first.
alfaridge.tune(y, x, nfolds = 10, a = seq(-1, 1, by = 0.1), lambda = seq(0, 2, by = 0.1), folds = NULL, ncores = 1, graph = TRUE, col.nu = 15, seed = NULL)
alfaridge.tune(y, x, nfolds = 10, a = seq(-1, 1, by = 0.1), lambda = seq(0, 2, by = 0.1), folds = NULL, ncores = 1, graph = TRUE, col.nu = 15, seed = NULL)
y |
A numeric vector containing the values of the target variable. If the values are proportions or percentages, i.e. strictly within 0 and 1 they are mapped into R using the logit transformation. |
x |
A numeric matrix containing the compositional data, i.e. the predictor variables. Zero values are allowed. |
nfolds |
The number of folds in the cross validation. |
a |
A vector with the a grid of values of |
lambda |
A vector with the a grid of values of |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
ncores |
The number of cores to use. If it is more than 1 parallel computing is performed. It is advisable to use it if you have many observations and or many variables, otherwise it will slow down th process. |
graph |
If graph is TRUE (default value) a filled contour plot will appear. |
col.nu |
A number parameter for the filled contour plot, taken into account only if graph is TRUE. |
seed |
You can specify your own seed number here or leave it NULL. |
A k-fold cross validation is performed.
If graph is TRUE a fileld contour a filled contour will appear. A list including:
mspe |
The MSPE where rows correspond to the |
best.par |
The best pair of |
performance |
The minimum mean squared error of prediction. |
runtime |
The run time of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Giorgos Athineou <[email protected]> and Michail Tsagris [email protected].
Hoerl A.E. and R.W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55-67.
Brown P. J. (1994). Measurement, Regression and Calibration. Oxford Science Publications.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) alfaridge.tune( y, x, nfolds = 10, a = seq(0.1, 1, by = 0.1), lambda = seq(0, 1, by = 0.1) )
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) alfaridge.tune( y, x, nfolds = 10, a = seq(0.1, 1, by = 0.1), lambda = seq(0, 1, by = 0.1) )
Cross validation for the TFLR model.
cv.tflr(y, x, nfolds = 10, folds = NULL, seed = NULL)
cv.tflr(y, x, nfolds = 10, folds = NULL, seed = NULL)
y |
A matrix with compositional response data. Zero values are allowed. |
x |
A matrix with compositional predictors. Zero values are allowed. |
nfolds |
The number of folds to be used. This is taken into consideration only if the folds argument is not supplied. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
If seed is TRUE the results will always be the same. |
A k-fold cross validation for the transformation-free linear regression for compositional responses and predictors is performed.
A list including:
runtime |
The runtime of the cross-validation procedure. |
kl |
The Kullback-Leibler divergences for all runs. |
js |
The Jensen-Shannon divergences for all runs. |
perf |
The average Kullback-Leibler divergence and average Jensen-Shannon divergence. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) y <- rdiri(100, runif(3, 1, 3)) x <- as.matrix(fgl[1:100, 2:9]) x <- x / rowSums(x) mod <- cv.tflr(y, x) mod
library(MASS) y <- rdiri(100, runif(3, 1, 3)) x <- as.matrix(fgl[1:100, 2:9]) x <- x / rowSums(x) mod <- cv.tflr(y, x) mod
-transformation
Cross-validation for LASSO with compositional predictors using the -transformation.
alfalasso.tune(y, x, a = seq(-1, 1, by = 0.1), model = "gaussian", lambda = NULL, type.measure = "mse", nfolds = 10, folds = NULL, stratified = FALSE)
alfalasso.tune(y, x, a = seq(-1, 1, by = 0.1), model = "gaussian", lambda = NULL, type.measure = "mse", nfolds = 10, folds = NULL, stratified = FALSE)
y |
A numerical vector or a matrix for multinomial logistic regression. |
x |
A numerical matrix containing the predictor variables, compositional data, where zero values are allowed.. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1.
If zero values are present it has to be greater than 0. If |
model |
The type of the regression model, "gaussian", "binomial", "poisson", "multinomial", or "mgaussian". |
lambda |
This information is copied from the package glmnet. A user supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this. WARNING: use with care. Avoid supplying a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit. |
type.measure |
This information is taken from the package glmnet. The loss function to use for cross-validation. For gaussian models this can be "mse", "deviance" for logistic and poisson regression, "class" applies to binomial and multinomial logistic regression only, and gives misclassification error. "auc" is for two-class logistic regression only, and gives The area under the ROC curve. "mse" or "mae" (mean absolute error) can be used by all models. |
nfolds |
The number of folds. Set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
stratified |
Do you want the folds to be created in a stratified way? TRUE or FALSE. |
The function uses the glmnet package to perform LASSO penalised regression. For more details see the function in that package.
A matrix with two columns and number of rows equal to the number of values used. Each row contains, the optimal value of the
penalty parameter for the LASSO and the optimal value of the loss function, for each value of
.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1–22.
alfa.lasso, cv.lasso.klcompreg, lasso.compreg, alfa.knn.reg
y <- iris[, 1] x <- rdiri(150, runif(20, 2, 5) ) mod <- alfalasso.tune( y, x, a = c(0.2, 0.5, 1) )
y <- iris[, 1] x <- rdiri(150, runif(20, 2, 5) ) mod <- alfalasso.tune( y, x, a = c(0.2, 0.5, 1) )
-SCLS model
Cross-validation for the -SCLS model.
cv.ascls(y, x, a = seq(0.1, 1, by = 0.1), nfolds = 10, folds = NULL, seed = NULL)
cv.ascls(y, x, a = seq(0.1, 1, by = 0.1), nfolds = 10, folds = NULL, seed = NULL)
y |
A numerical matrix with the simplicial response data. Zero values are allowed. |
x |
A matrix with the simplicial predictor variables. Zero values are allowed. |
a |
A vector or a single number of values of the |
nfolds |
The number of folds for the K-fold cross validation, set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
The K-fold cross validation is performed in order to select the optimal value for of the
-SCLS model.
A list including:
runtime |
The runtime of the cross-validation procedure. |
kl |
The Kullback-Leibler divergence for every value of |
js |
The Jensen-Shannon divergence for every value of |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.ascls(y, x, nfolds = 5)
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.ascls(y, x, nfolds = 5)
-TFLR model
Cross-validation for the -TFLR model.
cv.atflr(y, x, a = seq(0.1, 1, by = 0.1), nfolds = 10, folds = NULL, seed = NULL)
cv.atflr(y, x, a = seq(0.1, 1, by = 0.1), nfolds = 10, folds = NULL, seed = NULL)
y |
A numerical matrix with the simplicial response data. Zero values are allowed. |
x |
A matrix with the simplicial predictor variables. Zero values are allowed. |
a |
A vector or a single number of values of the |
nfolds |
The number of folds for the K-fold cross validation, set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
The K-fold cross validation is performed in order to select the optimal value for of the
-TFLR model.
A list including:
runtime |
The runtime of the cross-validation procedure. |
kl |
The Kullback-Leibler divergence for every value of |
js |
The Jensen-Shannon divergence for every value of |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.ascls(y, x, nfolds = 2, a = c(0.5, 1))
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.ascls(y, x, nfolds = 2, a = c(0.5, 1))
Cross-validation for the Dirichlet discriminant analysis.
cv.dda(x, ina, nfolds = 10, folds = NULL, stratified = TRUE, seed = NULL)
cv.dda(x, ina, nfolds = 10, folds = NULL, stratified = TRUE, seed = NULL)
x |
A matrix with the available data, the predictor variables. |
ina |
A vector of data. The response variable, which is categorical (factor is acceptable). |
folds |
A list with the indices of the folds. |
nfolds |
The number of folds to be used. This is taken into consideration only if "folds" is NULL. |
stratified |
Do you want the folds to be selected using stratified random sampling? This preserves the analogy of the samples of each group. Make this TRUE if you wish. |
seed |
If you set this to TRUE, the same folds will be created every time. |
This function estimates the performance of the Dirichlet discriminant analysis via k-fold cross-validation.
A list including:
percent |
The percentage of correct classification |
runtime |
The duration of the cross-validation proecdure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
Thomas P. Minka (2003). Estimating a Dirichlet distribution. http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf
dda, alfanb.tune, alfarda.tune, compknn.tune, cv.compnb
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- cv.dda(x, ina = iris[, 5] )
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- cv.dda(x, ina = iris[, 5] )
Cross-validation for the LASSO Kullback-Leibler divergence based regression.
cv.lasso.klcompreg(y, x, alpha = 1, type = "grouped", nfolds = 10, folds = NULL, seed = NULL, graph = FALSE)
cv.lasso.klcompreg(y, x, alpha = 1, type = "grouped", nfolds = 10, folds = NULL, seed = NULL, graph = FALSE)
y |
A numerical matrix with compositional data with or without zeros. |
x |
A matrix with the predictor variables. |
alpha |
The elastic net mixing parameter, with |
type |
This information is copied from the package glmnet.. If "grouped" then a grouped lasso penalty is used on the multinomial coefficients for a variable. This ensures they are all in our out together. The default in our case is "grouped". |
nfolds |
The number of folds for the K-fold cross validation, set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If graph is TRUE (default value) a filled contour plot will appear. |
The K-fold cross validation is performed in order to select the optimal value for ,
the penalty parameter in LASSO.
The outcome is the same as in the R package glmnet. The extra addition is that if "graph = TRUE",
then the plot of the cross-validated object is returned. The contains the logarithm of
and the deviance. The numbers on top of the figure show the number of set of coefficients for each
component, that are not zero.
Michail Tsagris and Abdulaziz Alenazi.
R implementation and documentation: Michail Tsagris [email protected] and Abdulaziz Alenazi [email protected].
Alenazi, A. A. (2022). f-divergence regression models for compositional data. Pakistan Journal of Statistics and Operation Research, 18(4): 867–882.
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1-22.
lasso.klcompreg, lassocoef.plot, lasso.compreg, cv.lasso.compreg, kl.compreg
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.lasso.klcompreg(y, x)
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.lasso.klcompreg(y, x)
Cross-validation for the LASSO log-ratio regression with compositional response.
cv.lasso.compreg(y, x, alpha = 1, nfolds = 10, folds = NULL, seed = NULL, graph = FALSE)
cv.lasso.compreg(y, x, alpha = 1, nfolds = 10, folds = NULL, seed = NULL, graph = FALSE)
y |
A numerical matrix with compositional data. Zero values are not allowed as the additive
log-ratio transformation ( |
x |
A matrix with the predictor variables. |
alpha |
The elastic net mixing parameter, with |
nfolds |
The number of folds for the K-fold cross validation, set to 10 by default. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If graph is TRUE (default value) a filled contour plot will appear. |
The K-fold cross validation is performed in order to select the optimal value for , the
penalty parameter in LASSO.
The outcome is the same as in the R package glmnet. The extra addition is that if "graph = TRUE", then the
plot of the cross-validated object is returned. The contains the logarithm of and the mean
squared error. The numbers on top of the figure show the number of set of coefficients for each component,
that are not zero.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1-22.
lasso.compreg, lasso.klcompreg, lassocoef.plot, cv.lasso.klcompreg,
comp.reg
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.lasso.compreg(y, x)
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) mod <- cv.lasso.compreg(y, x)
Cross-validation for the naive Bayes classifiers for compositional data.
cv.compnb(x, ina, type = "beta", folds = NULL, nfolds = 10, stratified = TRUE, seed = NULL, pred.ret = FALSE)
cv.compnb(x, ina, type = "beta", folds = NULL, nfolds = 10, stratified = TRUE, seed = NULL, pred.ret = FALSE)
x |
A matrix with the available data, the predictor variables. |
ina |
A vector of data. The response variable, which is categorical (factor is acceptable). |
type |
The type of naive Bayes, "beta", "logitnorm", "cauchy", "laplace", "gamma", "normlog" or "weibull". For the last 4 distributions, the negative of the logarithm of the compositional data is applied first. |
folds |
A list with the indices of the folds. |
nfolds |
The number of folds to be used. This is taken into consideration only if "folds" is NULL. |
stratified |
Do you want the folds to be selected using stratified random sampling? This preserves the analogy of the samples of each group. Make this TRUE if you wish. |
seed |
You can specify your own seed number here or leave it NULL. |
pred.ret |
If you want the predicted values returned set this to TRUE. |
A list including:
preds |
If pred.ret is TRUE the predicted values for each fold are returned as elements in a list. |
crit |
A vector whose length is equal to the number of k and is the accuracy metric for each k. For the classification case it is the percentage of correct classification. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- cv.compnb(x, ina = iris[, 5] )
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- cv.compnb(x, ina = iris[, 5] )
-transformation
Cross-validation for the naive Bayes classifiers for compositional data using the -transformation.
alfanb.tune(x, ina, a = seq(-1, 1, by = 0.1), type = "gaussian", folds = NULL, nfolds = 10, stratified = TRUE, seed = NULL)
alfanb.tune(x, ina, a = seq(-1, 1, by = 0.1), type = "gaussian", folds = NULL, nfolds = 10, stratified = TRUE, seed = NULL)
x |
A matrix with the available data, the predictor variables. |
ina |
A vector of data. The response variable, which is categorical (factor is acceptable). |
a |
The value of |
type |
The type of naive Bayes, "gaussian", "cauchy" or "laplace". |
folds |
A list with the indices of the folds. |
nfolds |
The number of folds to be used. This is taken into consideration only if "folds" is NULL. |
stratified |
Do you want the folds to be selected using stratified random sampling? This preserves the analogy of the samples of each group. Make this TRUE if you wish. |
seed |
You can specify your own seed number here or leave it NULL. |
This function estimates the performance of the naive Bayes classifier for each value of of the
-transformation.
A list including:
crit |
A vector whose length is equal to the number of k and is the accuracy metric for each k. For the classification case it is the percentage of correct classification. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
alfa.nb, alfarda.tune, compknn.tune, cv.dda, cv.compnb
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- alfanb.tune(x, ina = iris[, 5], a = c(0, 0.1, 0.2) )
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- alfanb.tune(x, ina = iris[, 5], a = c(0, 0.1, 0.2) )
Cross-validation for the SCLS model.
cv.scls(y, x, nfolds = 10, folds = NULL, seed = NULL)
cv.scls(y, x, nfolds = 10, folds = NULL, seed = NULL)
y |
A matrix with compositional response data. Zero values are allowed. |
x |
A matrix with compositional predictors. Zero values are allowed. |
nfolds |
The number of folds to be used. This is taken into consideration only if the folds argument is not supplied. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
The function performs k-fold cross-validation for the least squares regression where the beta coefficients are constained to be positive and sum to 1.
A list including:
runtime |
The runtime of the cross-validation procedure. |
kl |
The Kullback-Leibler divergences for all runs. |
js |
The Jensen-Shannon divergences for all runs. |
perf |
The average Kullback-Leibler divergence and average Jensen-Shannon divergence. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(3, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- cv.scls(y, x, nfolds = 5, seed = 12345) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(3, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- cv.scls(y, x, nfolds = 5, seed = 12345) mod
Cross-validation for the SCRQ model.
cv.scrq(y, x, nfolds = 10, folds = NULL, seed = NULL)
cv.scrq(y, x, nfolds = 10, folds = NULL, seed = NULL)
y |
A matrix with compositional response data. Zero values are allowed. |
x |
A matrix with compositional predictors. Zero values are allowed. |
nfolds |
The number of folds to be used. This is taken into consideration only if the folds argument is not supplied. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
seed |
You can specify your own seed number here or leave it NULL. |
The function performs k-fold cross-validation for the absolute regression where the beta coefficients are constained to be positive and sum to 1.
A list including:
runtime |
The runtime of the cross-validation procedure. |
kl |
The Kullback-Leibler divergences for all runs. |
js |
The Jensen-Shannon divergences for all runs. |
perf |
The average Kullback-Leibler divergence and average Jensen-Shannon divergence. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(3, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- cv.scrq(y, x, nfolds = 5, seed = 12345) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(3, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- cv.scrq(y, x, nfolds = 5, seed = 12345) mod
Simulation of compositional data from Gaussian mixture models.
dmix.compnorm(x, mu, sigma, prob, type = "alr", logged = TRUE)
dmix.compnorm(x, mu, sigma, prob, type = "alr", logged = TRUE)
x |
A vector or a matrix with compositional data. |
prob |
A vector with mixing probabilities. Its length is equal to the number of clusters. |
mu |
A matrix where each row corresponds to the mean vector of each cluster. |
sigma |
An array consisting of the covariance matrix of each cluster. |
type |
The type of trasformation used, either the additive log-ratio ("alr"), the isometric log-ratio ("ilr") or the pivot coordinate ("pivot") transformation. |
logged |
A boolean variable specifying whether the logarithm of the density values to be returned. It is set to TRUE by default. |
A sample from a multivariate Gaussian mixture model is generated.
A vector with the density values.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ryan P. Browne, Aisha ElSherbiny and Paul D. McNicholas (2015). R package mixture: Mixture Models for Clustering and Classification.
p <- c(1/3, 1/3, 1/3) mu <- matrix(nrow = 3, ncol = 4) s <- array( dim = c(4, 4, 3) ) x <- as.matrix(iris[, 1:4]) ina <- as.numeric(iris[, 5]) mu <- rowsum(x, ina) / 50 s[, , 1] <- cov(x[ina == 1, ]) s[, , 2] <- cov(x[ina == 2, ]) s[, , 3] <- cov(x[ina == 3, ]) y <- rmixcomp(100, p, mu, s, type = "alr")$x mod <- dmix.compnorm(y, mu, s, p)
p <- c(1/3, 1/3, 1/3) mu <- matrix(nrow = 3, ncol = 4) s <- array( dim = c(4, 4, 3) ) x <- as.matrix(iris[, 1:4]) ina <- as.numeric(iris[, 5]) mu <- rowsum(x, ina) / 50 s[, , 1] <- cov(x[ina == 1, ]) s[, , 2] <- cov(x[ina == 2, ]) s[, , 3] <- cov(x[ina == 3, ]) y <- rmixcomp(100, p, mu, s, type = "alr")$x mod <- dmix.compnorm(y, mu, s, p)
Density of the Flexible Dirichlet distribution
dfd(x, alpha, prob, tau)
dfd(x, alpha, prob, tau)
x |
A vector or a matrix with compositional data. |
alpha |
A vector of the non-negative |
prob |
A vector of the clusters' probabilities. It must sum to one. |
tau |
The non-negative scalar |
For more information see the references and the package FlxeDir.
The density value(s).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ongaro A. and Migliorati S. (2013). A generalization of the Dirichlet distribution. Journal of Multivariate Analysis, 114, 412–426.
Migliorati S., Ongaro A. and Monti G. S. (2017). A structured Dirichlet mixture model for compositional data: inferential and applicative issues. Statistics and Computing, 27, 963–983.
alpha <- c(12, 11, 10) prob <- c(0.25, 0.25, 0.5) tau <- 8 x <- rfd(20, alpha, prob, tau) dfd(x, alpha, prob, tau)
alpha <- c(12, 11, 10) prob <- c(0.25, 0.25, 0.5) tau <- 8 x <- rfd(20, alpha, prob, tau) dfd(x, alpha, prob, tau)
Density of the folded model normal distribution.
dfolded(x, a, p, mu, su, logged = TRUE)
dfolded(x, a, p, mu, su, logged = TRUE)
x |
A vector or a matrix with compositional data. No zeros are allowed. |
a |
The value of |
p |
The probability inside the simplex of the folded model. |
mu |
The mean vector. |
su |
The covariance matrix. |
logged |
A boolean variable specifying whether the logarithm of the density values to be returned. It is set to TRUE by default. |
Density values of the folded model.
The density value(s).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Stewart C. (2020). A folded model for compositional data analysis. Australian and New Zealand Journal of Statistics, 62(2): 249-277. https://arxiv.org/pdf/1802.07330.pdf
rfolded, a.est, folded.contour
s <- c(0.1490676523, -0.4580818209, 0.0020395316, -0.0047446076, -0.4580818209, 1.5227259250, 0.0002596411, 0.0074836251, 0.0020395316, 0.0002596411, 0.0365384838, -0.0471448849, -0.0047446076, 0.0074836251, -0.0471448849, 0.0611442781) s <- matrix(s, ncol = 4) m <- c(1.715, 0.914, 0.115, 0.167) x <- rfolded(100, m, s, 0.5) mod <- a.est(x) den <- dfolded(x, mod$best, mod$p, mod$mu, mod$su)
s <- c(0.1490676523, -0.4580818209, 0.0020395316, -0.0047446076, -0.4580818209, 1.5227259250, 0.0002596411, 0.0074836251, 0.0020395316, 0.0002596411, 0.0365384838, -0.0471448849, -0.0047446076, 0.0074836251, -0.0471448849, 0.0611442781) s <- matrix(s, ncol = 4) m <- c(1.715, 0.914, 0.115, 0.167) x <- rfolded(100, m, s, 0.5) mod <- a.est(x) den <- dfolded(x, mod$best, mod$p, mod$mu, mod$su)
Density values of a Dirichlet distribution.
ddiri(x, a, logged = TRUE)
ddiri(x, a, logged = TRUE)
x |
A matrix containing compositional data. This can be a vector or a matrix with the data. |
a |
A vector of parameters. Its length must be equal to the number of components, or columns of the matrix with the compositional data and all values must be greater than zero. |
logged |
A boolean variable specifying whether the logarithm of the density values to be returned. It is set to TRUE by default. |
The density of the Dirichlet distribution for a vector or a matrix of compositional data is returned.
A vector with the density values.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
dgendiri, diri.nr, diri.est, diri.contour, rdiri, dda
x <- rdiri( 100, c(5, 7, 4, 8, 10, 6, 4) ) a <- diri.est(x) f <- ddiri(x, a$param) sum(f) a
x <- rdiri( 100, c(5, 7, 4, 8, 10, 6, 4) ) a <- diri.est(x) f <- ddiri(x, a$param) sum(f) a
Density values of a generalised Dirichlet distribution.
dgendiri(x, a, b, logged = TRUE)
dgendiri(x, a, b, logged = TRUE)
x |
A matrix containing compositional data. This can be a vector or a matrix with the data. |
a |
A numerical vector with the shape parameter values of the Gamma distribution. |
b |
A numerical vector with the scale parameter values of the Gamma distribution. |
logged |
A boolean variable specifying whether the logarithm of the density values to be returned. It is set to TRUE by default. |
The density of the Dirichlet distribution for a vector or a matrix of compositional data is returned.
A vector with the density values.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
ddiri, rgendiri, diri.est, diri.contour, rdiri, dda
a <- c(1, 2, 3) b <- c(2, 3, 4) x <- rgendiri(100, a, b) y <- dgendiri(x, a, b)
a <- c(1, 2, 3) b <- c(2, 3, 4) x <- rgendiri(100, a, b) y <- dgendiri(x, a, b)
Density values of a mixture of Dirichlet distributions.
dmixdiri(x, a, prob, logged = TRUE)
dmixdiri(x, a, prob, logged = TRUE)
x |
A vector or a matrix with compositional data. Zeros are not allowed. |
a |
A matrix where each row contains the parameters of each Dirichlet component. |
prob |
A vector with the mixing probabilities. |
logged |
A boolean variable specifying whether the logarithm of the density values to be returned. It is set to TRUE by default. |
The density of the mixture of Dirichlet distribution for a vector or a matrix of compositional data is returned.
A vector with the density values.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ye X., Yu Y. K. and Altschul S. F. (2011). On the inference of Dirichlet mixture priors for protein sequence comparison. Journal of Computational Biology, 18(8), 941-954.
a <- matrix( c(12, 30, 45, 32, 50, 16), byrow = TRUE,ncol = 3) prob <- c(0.5, 0.5) x <- rmixdiri(100, a, prob)$x f <- dmixdiri(x, a, prob)
a <- matrix( c(12, 30, 45, 32, 50, 16), byrow = TRUE,ncol = 3) prob <- c(0.5, 0.5) x <- rmixdiri(100, a, prob)$x f <- dmixdiri(x, a, prob)
Dirichlet discriminant analysis.
dda(xnew, x, ina)
dda(xnew, x, ina)
xnew |
A matrix with the new compositional predictor data whose class you want to predict. Zeros are allowed. |
x |
A matrix with the available compositional predictor data. Zeros are allowed. |
ina |
A vector of data. The response variable, which is categorical (factor is acceptable). |
The funcitons performs maximum likelihood discriminant analysis using the Dirichlet distribution.
A vector with the estimated group.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
Thomas P. Minka (2003). Estimating a Dirichlet distribution. http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
cv.dda, comp.nb, alfa.rda, alfa.knn,
comp.knn, mix.compnorm, diri.reg, zadr
x <- Compositional::rdiri(100, runif(5) ) ina <- rbinom(100, 1, 0.5) + 1 mod <- dda(x, x, ina )
x <- Compositional::rdiri(100, runif(5) ) ina <- rbinom(100, 1, 0.5) + 1 mod <- dda(x, x, ina )
Dirichlet random values simulation.
rdiri(n, a)
rdiri(n, a)
n |
The sample size, a numerical value. |
a |
A numerical vector with the parameter values. |
The algorithm is straightforward, for each vector, independent gamma values are generated and then divided by their total sum.
A matrix with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
diri.est, diri.nr, diri.contour, rgendiri
x <- rdiri( 100, c(5, 7, 1, 3, 10, 2, 4) ) diri.est(x)
x <- rdiri( 100, c(5, 7, 1, 3, 10, 2, 4) ) diri.est(x)
Dirichlet regression.
diri.reg(y, x, plot = FALSE, xnew = NULL) diri.reg2(y, x, xnew = NULL) diri.reg3(y, x, xnew = NULL)
diri.reg(y, x, plot = FALSE, xnew = NULL) diri.reg2(y, x, xnew = NULL) diri.reg3(y, x, xnew = NULL)
y |
A matrix with the compositional data (dependent variable). Zero values are not allowed. |
x |
The predictor variable(s), they can be either continuous or categorical or both. |
plot |
A boolean variable specifying whether to plot the leverage values of the observations or not. This is taken into account only when xnew = NULL. |
xnew |
If you have new data use it, otherwise leave it NULL. |
A Dirichlet distribution is assumed for the regression. This involves numerical optimization.
The function "diri.reg2()" allows for the covariates to be linked with the precision parameter
via the exponential link function
. The function "diri.reg3()"
links the covariates to the alpha parameters of the Dirichlet distribution, i.e. it uses the
classical parametrization of the distribution. This means, that there is a set of regression
parameters for each component.
A list including:
runtime |
The time required by the regression. |
loglik |
The value of the log-likelihood. |
phi |
The precision parameter. If covariates are linked with it (function "diri.reg2()"), this will be a vector. |
phipar |
The coefficients of the phi parameter if it is linked to the covariates. |
std.phi |
The standard errors of the coefficients of the phi parameter is it linked to the covariates. |
log.phi |
The logarithm of the precision parameter. |
std.logphi |
The standard error of the logarithm of the precision parameter. |
be |
The beta coefficients. |
seb |
The standard error of the beta coefficients. |
sigma |
Th covariance matrix of the regression parameters (for the mean vector and the phi parameter)". |
lev |
The leverage values. |
est |
For the "diri.reg" this contains the fitted or the predicted values (if xnew is not NULL). For the "diri.reg2" if xnew is NULL, this is also NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Maier, Marco J. (2014) DirichletReg: Dirichlet Regression for Compositional Data in R. Research Report Series/Department of Statistics and Mathematics, 125. WU Vienna University of Economics and Business, Vienna. http://epub.wu.ac.at/4077/1/Report125.pdf
Gueorguieva, Ralitza, Robert Rosenheck, and Daniel Zelterman (2008). Dirichlet component regression and its applications to psychiatric data. Computational statistics & data analysis 52(12): 5344-5355.
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
js.compreg, kl.compreg, ols.compreg, comp.reg, alfa.reg, diri.nr, dda
x <- as.vector(iris[, 4]) y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) mod1 <- diri.reg(y, x) mod2 <- diri.reg2(y, x) mod3 <- comp.reg(y, x)
x <- as.vector(iris[, 4]) y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) mod1 <- diri.reg(y, x) mod2 <- diri.reg2(y, x) mod3 <- comp.reg(y, x)
Distance based regression models for proportions.
ols.prop.reg(y, x, cov = FALSE, tol = 1e-07, maxiters = 100) helling.prop.reg(y, x, tol = 1e-07, maxiters = 100)
ols.prop.reg(y, x, cov = FALSE, tol = 1e-07, maxiters = 100) helling.prop.reg(y, x, tol = 1e-07, maxiters = 100)
y |
A numerical vector proportions. 0s and 1s are allowed. |
x |
A matrix or a data frame with the predictor variables. |
cov |
Should the covariance matrix be returned? TRUE or FALSE. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. This is set to |
maxiters |
The maximum number of iterations before the Newton-Raphson is terminated automatically. |
We are using the Newton-Raphson, but unlike R's built-in function "glm" we do no checks and no extra calculations, or whatever. Simply the model. The functions accept binary responses as well (0 or 1).
A list including:
sse |
The sum of squres of errors for the "ols.prop.reg" function. |
be |
The estimated regression coefficients. |
seb |
The standard error of the regression coefficients if "cov" is TRUE. |
covb |
The covariance matrix of the regression coefficients in "ols.prop.reg" if "cov" is TRUE. |
H |
The Hellinger distance between the true and the obseervd proportions in "helling.prop.reg". |
iters |
The number of iterations required by the Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Papke L. E. & Wooldridge J. (1996). Econometric methods for fractional response variables with an application to 401(K) plan participation rates. Journal of Applied Econometrics, 11(6): 619–632.
McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
y <- rbeta(100, 1, 4) x <- matrix(rnorm(100 * 2), ncol = 2) a1 <- ols.prop.reg(y, x) a2 <- helling.prop.reg(y, x)
y <- rbeta(100, 1, 4) x <- matrix(rnorm(100 * 2), ncol = 2) a1 <- ols.prop.reg(y, x) a2 <- helling.prop.reg(y, x)
Regression for compositional data based on the Kullback-Leibler the Jensen-Shannon divergence and the symmetric Kullback-Leibler divergence.
kl.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL, tol = 1e-07, maxiters = 50) js.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL) tv.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL) symkl.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL) hellinger.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL)
kl.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL, tol = 1e-07, maxiters = 50) js.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL) tv.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL) symkl.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL) hellinger.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
The predictor variable(s), they can be either continnuous or categorical or both. |
con |
If this is TRUE (default) then the constant term is estimated, otherwise the model includes no constant term. |
B |
If B is greater than 1 bootstrap estimates of the standard error are returned. If B=1, no standard errors are returned. |
ncores |
If ncores is 2 or more parallel computing is performed. This is to be used for the case of bootstrap. If B=1, this is not taken into consideration. |
xnew |
If you have new data use it, otherwise leave it NULL. |
tol |
The tolerance value to terminate the Newton-Raphson procedure. |
maxiters |
The maximum number of Newton-Raphson iterations. |
In the kl.compreg() the Kullback-Leibler divergence is adopted as the objective function. In case of problematic convergence the "multinom" function by the "nnet" package is employed. This will obviously be slower. The js.compreg() uses the Jensen-Shannon divergence and the symkl.compreg() uses the symmetric Kullback-Leibler divergence. The tv.compreg() uses the Total Variation divergence. There is no actual log-likelihood for the last three regression models. The hellinger.compreg() minimizes the Hellinger distance.
A list including:
runtime |
The time required by the regression. |
iters |
The number of iterations required by the Newton-Raphson in the kl.compreg function. |
loglik |
The log-likelihood. This is actually a quasi multinomial regression. This is bascially half the negative deviance, or
|
be |
The beta coefficients. |
covbe |
The covariance matrix of the beta coefficients, if bootstrap is chosen, i.e. if B > 1. |
est |
The fitted values of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Murteira Jose MR, and Joaquim JS Ramalho (2016). Regression analysis of multivariate fractional data. Econometric Reviews 35(4): 515-552.
Tsagris Michail (2015). A novel, divergence based, regression for compositional data. Proceedings of the 28th Panhellenic Statistics Conference, 15-18/4/2015, Athens, Greece. https://arxiv.org/pdf/1511.07600.pdf
Endres D. M. and Schindelin J. E. (2003). A new metric for probability distributions. Information Theory, IEEE Transactions on 49, 1858-1860.
Osterreicher F. and Vajda I. (2003). A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics 55, 639-653.
Alenazi A. A. (2022). f-divergence regression models for compositional data. Pakistan Journal of Statistics and Operation Research, 18(4): 867–882.
diri.reg, ols.compreg, comp.reg
library(MASS) x <- as.vector(fgl[, 1]) y <- as.matrix(fgl[, 2:9]) y <- y / rowSums(y) mod1<- kl.compreg(y, x, B = 1, ncores = 1) mod2 <- js.compreg(y, x, B = 1, ncores = 1)
library(MASS) x <- as.vector(fgl[, 1]) y <- as.matrix(fgl[, 2:9]) y <- y / rowSums(y) mod1<- kl.compreg(y, x, B = 1, ncores = 1) mod2 <- js.compreg(y, x, B = 1, ncores = 1)
-transformation
Divergence based regression for compositional data with compositional data in the covariates side using the -transformation.
kl.alfapcr(y, x, covar = NULL, a, k, xnew = NULL, B = 1, ncores = 1, tol = 1e-07, maxiters = 50)
kl.alfapcr(y, x, covar = NULL, a, k, xnew = NULL, B = 1, ncores = 1, tol = 1e-07, maxiters = 50)
y |
A numerical matrixc with compositional data with or without zeros. |
x |
A matrix with the predictor variables, the compositional data. Zero values are allowed. |
covar |
If you have other covariates as well put themn here. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0.
If |
k |
A number at least equal to 1. How many principal components to use. |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
B |
If B is greater than 1 bootstrap estimates of the standard error are returned. If B=1, no standard errors are returned. |
ncores |
If ncores is 2 or more parallel computing is performed. This is to be used for the case of bootstrap. If B=1, this is not taken into consideration. |
tol |
The tolerance value to terminate the Newton-Raphson procedure. |
maxiters |
The maximum number of Newton-Raphson iterations. |
The -transformation is applied to the compositional data first, the first k principal component scores are calcualted and used as predictor variables for the Kullback-Leibler divergence based regression model.
A list including:
runtime |
The time required by the regression. |
iters |
The number of iterations required by the Newton-Raphson in the kl.compreg function. |
loglik |
The log-likelihood. This is actually a quasi multinomial regression. This is bascially minus the half deviance, or
|
be |
The beta coefficients. |
seb |
The standard error of the beta coefficients, if bootstrap is chosen, i.e. if B > 1. |
est |
The fitted values of xnew if xnew is not NULL. |
Initial code by Abdulaziz Alenazi. Modifications by Michail Tsagris.
R implementation and documentation: Abdulaziz Alenazi [email protected] and Michail Tsagris [email protected].
Alenazi A. (2019). Regression for compositional data with compositional data as predictor variables with or without zero values. Journal of Data Science, 17(1): 219-238. https://jds-online.org/journal/JDS/article/136/file/pdf
Tsagris M. (2015). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2): 47-57. http://arxiv.org/pdf/1508.01913v1.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. http://arxiv.org/pdf/1106.1451.pdf
klalfapcr.tune, tflr, glm.pcr, alfapcr.tune
library(MASS) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- alfa.pcr(y = y, x = x, a = 0.7, k = 1) mod
library(MASS) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- alfa.pcr(y = y, x = x, a = 0.7, k = 1) mod
Divergence matrix of compositional data.
divergence(x, type = "kullback_leibler", vector = FALSE)
divergence(x, type = "kullback_leibler", vector = FALSE)
x |
A matrix with the compositional data. |
type |
This is either "kullback_leibler" (Kullback-Leibler, which computes the symmetric Kullback-Leibler divergence) or "jensen_shannon" (Jensen-Shannon) divergence. |
vector |
For return a vector instead a matrix. |
The function produces the distance matrix either using the Kullback-Leibler (distance) or the Jensen-Shannon (metric) divergence. The Kullback-Leibler refers to the symmetric Kullback-Leibler divergence.
if the vector argument is FALSE a symmetric matrix with the divergences, otherwise a vector with the divergences.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Endres, D. M. and Schindelin, J. E. (2003). A new metric for probability distributions. Information Theory, IEEE Transactions on 49, 1858-1860.
Osterreicher, F. and Vajda, I. (2003). A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics 55, 639-653.
x <- as.matrix(iris[1:20, 1:4]) x <- x / rowSums(x) divergence(x)
x <- as.matrix(iris[1:20, 1:4]) x <- x / rowSums(x) divergence(x)
Empirical likelihood hypothesis testing for two mean vectors.
el.test2(y1, y2, R = 0, ncores = 1, graph = FALSE)
el.test2(y1, y2, R = 0, ncores = 1, graph = FALSE)
y1 |
A matrix containing the Euclidean data of the first group. |
y2 |
A matrix containing the Euclidean data of the second group. |
R |
If R is 0, the classical chi-square distribution is used, if R = 1, the corrected chi-square distribution (James, 1954) is used and if R = 2, the modified F distribution (Krishnamoorthy and Yanping, 2006) is used. If R is greater than 3 bootstrap calibration is performed. |
ncores |
How many to cores to use. |
graph |
A boolean variable which is taken into consideration only when bootstrap calibration is performed. IF TRUE the histogram of the bootstrap test statistic values is plotted. |
The is that
and the two constraints imposed by EL are
where and the
are Lagrangian parameters introduced to maximize the above expression. Note that the maximization of is with respect to the
. The probabilities of the
-th sample have the following form
. The log-likelihood ratio test statistic can be written as
The test is implemented by searching for the mean vector that minimizes the sum of the two one sample EL test statistics.
A list including:
test |
The empirical likelihood test statistic value. |
modif.test |
The modified test statistic, either via the chi-square or the F distribution. |
dof |
Thre degrees of freedom of the chi-square or the F distribution. |
pvalue |
The asymptotic or the bootstrap p-value. |
mu |
The estimated common mean vector. |
runtime |
The runtime of the bootstrap calibration. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Amaral G.J.A., Dryden I.L. and Wood A.T.A. (2007). Pivotal bootstrap methods for k-sample problems in directional statistics and shape analysis. Journal of the American Statistical Association, 102(478): 695–707.
Owen A. B. (2001). Empirical likelihood. Chapman and Hall/CRC Press.
Owen A.B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2): 237–249.
Preston S.P. and Wood A.T.A. (2010). Two-Sample Bootstrap Hypothesis Tests for Three-Dimensional Labelled Landmark Data. Scandinavian Journal of Statistics, 37(4): 568–587.
eel.test2, maovjames, hotel2T2, james
el.test2( y1 = as.matrix(iris[1:25, 1:4]), y2 = as.matrix(iris[26:50, 1:4]), R = 0 ) el.test2( y1 = as.matrix(iris[1:25, 1:4]), y2 = as.matrix(iris[26:50, 1:4]), R = 1 ) el.test2( y1 =as.matrix(iris[1:25, 1:4]), y2 = as.matrix(iris[26:50, 1:4]), R = 2 )
el.test2( y1 = as.matrix(iris[1:25, 1:4]), y2 = as.matrix(iris[26:50, 1:4]), R = 0 ) el.test2( y1 = as.matrix(iris[1:25, 1:4]), y2 = as.matrix(iris[26:50, 1:4]), R = 1 ) el.test2( y1 =as.matrix(iris[1:25, 1:4]), y2 = as.matrix(iris[26:50, 1:4]), R = 2 )
-transformation
Energy test of equality of distributions using the -transformation.
aeqdist.etest(x, sizes, a = 1, R = 999)
aeqdist.etest(x, sizes, a = 1, R = 999)
x |
A matrix with the compositional data with all groups stacked one under the other. |
sizes |
A numeric vector matrix with the sample sizes. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero
values are present it has to be greater than 0. If |
R |
The number of permutations to apply in order to compute the approximate p-value. |
The -transformation is applied to each composition and then the
energy distance of equality of distributions is applied for each value of
or for the single value of
.
A numerical value or a numerical vector, depending on the length of the values
of , with the approximate p-value(s) of the energy test.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension. InterStat, November (5).
Szekely, G. J. (2000) Technical Report 03-05: E-statistics: Energy of Statistical Samples. Department of Mathematics and Statistics, Bowling Green State University.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
acor, acor.tune, alfa, alfa.profile
y <- rdiri(50, c(3, 4, 5) ) x <- rdiri(60, c(3, 4, 5) ) aeqdist.etest( rbind(x, y), c(dim(x)[1], dim(y)[1]), a = c(-1, 0, 1) )
y <- rdiri(50, c(3, 4, 5) ) x <- rdiri(60, c(3, 4, 5) ) aeqdist.etest( rbind(x, y), c(dim(x)[1], dim(y)[1]), a = c(-1, 0, 1) )
Estimating location and scatter parameters for compositional data in a robust and non robust way.
comp.den(x, type = "alr", dist = "normal", tol = 1e-07)
comp.den(x, type = "alr", dist = "normal", tol = 1e-07)
x |
A matrix containing compositional data. No zero values are allowed. |
type |
A boolean variable indicating the transformation to be used. Either "alr" or "ilr" corresponding to the additive or the isometric log-ratio transformation respectively. |
dist |
Takes values "normal", "t", "skewnorm", "rob" and "spatial". They first three options correspond to the parameters of the normal, t and skew normal distribution respectively. If it set to "rob" the MCD estimates are computed and if set to "spatial" the spatial median and spatial sign covariance matrix are computed. |
tol |
A tolerance level to terminate the process of finding the spatial median when dist = "spatial". This is set to 1e-09 by default. |
This function calculates robust and non robust estimates of location and scatter.
A list including: The mean vector and covariance matrix mainly. Other parameters are also returned depending on the value of the argument "dist".
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212-223.
Mardia K.V., Kent J.T., and Bibby J.M. (1979). Multivariate analysis. Academic press.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
T. Karkkaminen and S. Ayramo (2005). On computation of spatial median for robust data mining. Evolutionary and Deterministic Methods for Design, Optimization and Control with Applications to Industrial and Societal Problems EUROGEN 2005.
A Durre, D Vogel, DE Tyler (2014). The spatial sign covariance matrix with unknown location. Journal of Multivariate Analysis, 130: 107-117.
J. T. Kent, D. E. Tyler and Y. Vardi (1994) A curious likelihood identity for the multivariate t-distribution. Communications in Statistics-Simulation and Computation 23, 441-453.
Azzalini A. and Dalla Valle A. (1996). The multivariate skew-normal distribution. Biometrika 83(4): 715-726.
library(MASS) x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) comp.den(x) comp.den(x, type = "alr", dist = "t") comp.den(x, type = "alr", dist = "spatial")
library(MASS) x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) comp.den(x) comp.den(x, type = "alr", dist = "t") comp.den(x, type = "alr", dist = "spatial")
Estimation of the probability left outside the simplex when using the alpha-transformationn.
probout(mu, su, a)
probout(mu, su, a)
mu |
The mean vector. |
su |
The covariance matrix. |
a |
The value of |
When applying the -transformation based on a multivariate normal there might be
probability left outside the simplex as the space of this transformation is a subspace of the
Euclidean space. The function estimates the missing probability via Monte Carlo simulation using
40 million generated vectors.
The estimated probability left outside the simplex.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Stewart C. (2020). A folded model for compositional data analysis. Australian and New Zealand Journal of Statistics, 62(2): 249-277. https://arxiv.org/pdf/1802.07330.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
alfa, alpha.mle, a.est, rfolded
s <- c(0.1490676523, -0.4580818209, 0.0020395316, -0.0047446076, -0.4580818209, 1.5227259250, 0.0002596411, 0.0074836251, 0.0020395316, 0.0002596411, 0.0365384838, -0.0471448849, -0.0047446076, 0.0074836251, -0.0471448849, 0.0611442781) s <- matrix(s, ncol = 4) m <- c(1.715, 0.914, 0.115, 0.167) probout(m, s, 0.5)
s <- c(0.1490676523, -0.4580818209, 0.0020395316, -0.0047446076, -0.4580818209, 1.5227259250, 0.0002596411, 0.0074836251, 0.0020395316, 0.0002596411, 0.0365384838, -0.0471448849, -0.0047446076, 0.0074836251, -0.0471448849, 0.0611442781) s <- matrix(s, ncol = 4) m <- c(1.715, 0.914, 0.115, 0.167) probout(m, s, 0.5)
in the folded model
Estimation of the value of in the folded model.
a.est(x)
a.est(x)
x |
A matrix with the compositional data. No zero vaues are allowed. |
This is a function for choosing or estimating the value of
in the folded model (Tsagris and Stewart, 2020).
A list including:
runtime |
The runtime of the algorithm. |
best |
The estimated optimal |
loglik |
The maximimised log-likelihood of the folded model. |
p |
The estimated probability inside the simplex of the folded model. |
mu |
The estimated mean vector of the folded model. |
su |
The estimated covariance matrix of the folded model. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Stewart C. (2022). A Review of Flexible Transformations for Modeling Compositional Data. In Advances and Innovations in Statistics and Data Science, pp. 225–234. https://link.springer.com/chapter/10.1007/978-3-031-08329-7_10
Tsagris M. and Stewart C. (2020). A folded model for compositional data analysis. Australian and New Zealand Journal of Statistics, 62(2): 249-277. https://arxiv.org/pdf/1802.07330.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
alfa.profile, alfa, alfainv, alpha.mle
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) alfa.tune(x) a.est(x)
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) alfa.tune(x) a.est(x)
via the alfa profile log-likelihood
Estimation of the value of via the alfa profile log-likelihood.
alfa.profile(x, a = seq(-1, 1, by = 0.01))
alfa.profile(x, a = seq(-1, 1, by = 0.01))
x |
A matrix with the compositional data. Zero values are not allowed. |
a |
A grid of values of |
For every value of the normal likelihood (see the refernece) is computed. At the end, the plot of the values is constructed.
A list including:
res |
The chosen value of |
ci |
An asympotic 95% confidence interval computed from the log-likelihood ratio test. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) alfa.tune(x) alfa.profile(x)
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) alfa.tune(x) alfa.profile(x)
Exponential empirical likelihood hypothesis testing for two mean vectors.
eel.test2(y1, y2, tol = 1e-07, R = 0, graph = FALSE)
eel.test2(y1, y2, tol = 1e-07, R = 0, graph = FALSE)
y1 |
A matrix containing the Euclidean data of the first group. |
y2 |
A matrix containing the Euclidean data of the second group. |
tol |
The tolerance level used to terminate the Newton-Raphson algorithm. |
R |
If R is 0, the classical chi-square distribution is used, if R = 1, the corrected chi-square distribution (James, 1954) is used and if R = 2, the modified F distribution (Krishnamoorthy and Yanping, 2006) is used. If R is greater than 3 bootstrap calibration is performed. |
graph |
A boolean variable which is taken into consideration only when bootstrap calibration is performed. IF TRUE the histogram of the bootstrap test statistic values is plotted. |
Exponential empirical likelihood or exponential tilting was first introduced by Efron (1981) as a way to perform a "tilted" version of the bootstrap for the one sample mean hypothesis testing. Similarly to the empirical likelihood, positive weights , which sum to one, are allocated to the observations, such that the weighted sample mean
is equal to some population mean
, under the
. Under
the weights are equal to
, where
is the sample size. Following Efron (1981), the choice of
will minimize the Kullback-Leibler distance from
to
subject to the constraint . The probabilities take the form
and the constraint becomes
Similarly to empirical likelihood a numerical search over is required.
We can derive the asymptotic form of the test statistic in the two sample means case but in a simpler form, generalizing the approach of Jing and Robinson (1997) to the multivariate case as follows. The three constraints are
Similarly to EL the sum of a linear combination of the is set to zero. We can equate the first two constraints of
Also, we can write the third constraint of as and thus rewrite the first two constraints as
This trick allows us to avoid the estimation of the common mean. It is not possible though to do this in the empirical likelihood method. Instead of minimisation of the sum of the one-sample test statistics from the common mean, we can define the probabilities by searching for the which makes the last equation hold true. The third constraint of is a convenient constraint, but Jing and Robinson (1997) mention that even though as a constraint is simple it does not lead to second-order accurate confidence intervals unless the two sample sizes are equal. Asymptotically, the test statistic follows a
under the null hypothesis.
A list including:
test |
The empirical likelihood test statistic value. |
modif.test |
The modified test statistic, either via the chi-square or the F distribution. |
dof |
The degrees of freedom of the chi-square or the F distribution. |
pvalue |
The asymptotic or the bootstrap p-value. |
mu |
The estimated common mean vector. |
runtime |
The runtime of the bootstrap calibration. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Efron B. (1981) Nonparametric standard errors and confidence intervals. Canadian Journal of Statistics, 9(2): 139–158.
Jing B.Y. and Wood A.T.A. (1996). Exponential empirical likelihood is not Bartlett correctable. Annals of Statistics, 24(1): 365–369.
Jing B.Y. and Robinson J. (1997). Two-sample nonparametric tilting method. Australian Journal of Statistics, 39(1): 25–34.
Owen A.B. (2001). Empirical likelihood. Chapman and Hall/CRC Press.
Preston S.P. and Wood A.T.A. (2010). Two-Sample Bootstrap Hypothesis Tests for Three-Dimensional Labelled Landmark Data. Scandinavian Journal of Statistics 37(4): 568–587.
Tsagris M., Preston S. and Wood A.T.A. (2017). Nonparametric hypothesis testing for equality of means on the simplex. Journal of Statistical Computation and Simulation, 87(2): 406–422.
el.test2, maovjames, hotel2T2,
james
y1 = as.matrix(iris[1:25, 1:4]) y2 = as.matrix(iris[26:50, 1:4]) eel.test2(y1, y2) eel.test2(y1, y2 ) eel.test2( y1, y2 )
y1 = as.matrix(iris[1:25, 1:4]) y2 = as.matrix(iris[26:50, 1:4]) eel.test2(y1, y2) eel.test2(y1, y2 ) eel.test2( y1, y2 )
Fast estimation of the value of .
alfa.tune(x, B = 1, ncores = 1)
alfa.tune(x, B = 1, ncores = 1)
x |
A matrix with the compositional data. No zero vaues are allowed. |
B |
If no (bootstrap based) confidence intervals should be returned this should be 1 and more than 1 otherwise. |
ncores |
If ncores is greater than 1 parallel computing is performed. It is advisable to use it if you have many observations and or many variables, otherwise it will slow down th process. |
This is a faster function than alfa.profile
for choosing the value of .
A vector with the best alpha, the maximised log-likelihood and the log-likelihood at , when B = 1 (no bootstrap). If B>1 a list including:
param |
The best alpha and the value of the log-likelihod, along with the 95% bootstrap based confidence intervals. |
message |
A message with some information about the histogram. |
runtime |
The time (in seconds) of the process. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) alfa.tune(x) alfa.profile(x)
library(MASS) x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) alfa.tune(x) alfa.profile(x)
Gaussian mixture models for compositional data.
mix.compnorm(x, g, model, type = "alr", veo = FALSE)
mix.compnorm(x, g, model, type = "alr", veo = FALSE)
x |
A matrix with the compositional data. |
g |
How many clusters to create. |
model |
The type of model to be used.
|
type |
The type of trasformation to be used, either the additive log-ratio ("alr"), the isometric log-ratio ("ilr") or the pivot coordinate ("pivot") transformation. |
veo |
Stands for "Variables exceed observations". If TRUE then if the number variablesin the model exceeds the number of observations, but the model is still fitted. |
A log-ratio transformation is applied and then a Gaussian mixture model is constructed.
A list including:
mu |
A matrix where each row corresponds to the mean vector of each cluster. |
su |
An array containing the covariance matrix of each cluster. |
prob |
The estimated mixing probabilities. |
est |
The estimated cluster membership values. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ryan P. Browne, Aisha ElSherbiny and Paul D. McNicholas (2015). R package mixture: Mixture Models for Clustering and Classification.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
bic.mixcompnorm, rmixcomp, mix.compnorm.contour, alfa.mix.norm,
alfa.knn,
alfa.rda, comp.nb
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) mod1 <- mix.compnorm(x, 3, model = "EII" ) mod2 <- mix.compnorm(x, 4, model = "VII")
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) mod1 <- mix.compnorm(x, 3, model = "EII" ) mod2 <- mix.compnorm(x, 4, model = "VII")
-transformation
Gaussian mixture models for compositional data using the -transformation.
alfa.mix.norm(x, g, a, model, veo = FALSE)
alfa.mix.norm(x, g, a, model, veo = FALSE)
x |
A matrix with the compositional data. |
g |
How many clusters to create. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0.
If |
model |
The type of model to be used.
|
veo |
Stands for "Variables exceed observations". If TRUE then if the number variablesin the model exceeds the number of observations, but the model is still fitted. |
A log-ratio transformation is applied and then a Gaussian mixture model is constructed.
A list including:
mu |
A matrix where each row corresponds to the mean vector of each cluster. |
su |
An array containing the covariance matrix of each cluster. |
prob |
The estimated mixing probabilities. |
est |
The estimated cluster membership values. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ryan P. Browne, Aisha ElSherbiny and Paul D. McNicholas (2015). R package mixture: Mixture Models for Clustering and Classification.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
bic.alfamixnorm, bic.mixcompnorm, rmixcomp, mix.compnorm.contour, mix.compnorm,
alfa, alfa.knn, alfa.rda, comp.nb
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) mod1 <- alfa.mix.norm(x, 3, 0.4, model = "EII" ) mod2 <- alfa.mix.norm(x, 4, 0.7, model = "VII")
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) mod1 <- alfa.mix.norm(x, 3, 0.4, model = "EII" ) mod2 <- alfa.mix.norm(x, 4, 0.7, model = "VII")
Generalised Dirichlet random values simulation.
rgendiri(n, a, b)
rgendiri(n, a, b)
n |
The sample size, a numerical value. |
a |
A numerical vector with the shape parameter values of the Gamma distribution. |
b |
A numerical vector with the scale parameter values of the Gamma distribution. |
The algorithm is straightforward, for each vector, independent gamma values are generated and
then divided by their total sum. The difference with rdiri
is that
here the Gamma distributed variables are not equally scaled.
A matrix with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
rdiri, diri.est, diri.nr, diri.contour
a <- c(1, 2, 3) b <- c(2, 3, 4) x <- rgendiri(100, a, b)
a <- c(1, 2, 3) b <- c(2, 3, 4) x <- rgendiri(100, a, b)
Random folds for use in a cross validation are generated. There is the option for stratified splitting as well.
makefolds(ina, nfolds = 10, stratified = TRUE, seed = NULL)
makefolds(ina, nfolds = 10, stratified = TRUE, seed = NULL)
ina |
A variable indicating the groupings. |
nfolds |
The number of folds to produce. |
stratified |
A boolean variable specifying whether stratified random (TRUE) or simple random (FALSE) sampling is to be used when producing the folds. |
seed |
You can specify your own seed number here or leave it NULL. |
I was inspired by the command in the package TunePareto in order to do the stratified version.
A list with nfolds elements where each elements is a fold containing the indices of the data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
a <- makefolds(iris[, 5], nfolds = 5, stratified = TRUE) table(iris[a[[1]], 5]) ## 10 values from each group
a <- makefolds(iris[, 5], nfolds = 5, stratified = TRUE) table(iris[a[[1]], 5]) ## 10 values from each group
Greenacre's power transformation.
green(x, theta)
green(x, theta)
x |
A matrix with the compositional data. |
theta |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to
be greater than 0. If |
Greenacre's transformation is applied to the compositional data.
A matrix with the power transformed data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Greenacre, M. (2009). Power transformations in correspondence analysis. Computational Statistics & Data Analysis, 53(8): 3107-3116. http://www.econ.upf.edu/~michael/work/PowerCA.pdf
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- green(x, 0.1) y2 <- green(x, 0.2) rbind( colMeans(y1), colMeans(y2) )
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- green(x, 0.1) y2 <- green(x, 0.2) rbind( colMeans(y1), colMeans(y2) )
Helper Frechet mean for compositional data.
frechet2(x, di, a, k)
frechet2(x, di, a, k)
x |
A matrix with the compositional data. |
di |
A matrix with indices as produced by the function "dista" of the package "Rfast"" or the function "nn" of the package "Rnanoflann". Better see the details section. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
k |
The number of nearest neighbours used for the computation of the Frechet means. |
The power transformation is applied to the compositional data and the mean vector is calculated. Then the inverse of it is calculated and the inverse of the power transformation applied to the last vector is the Frechet mean.
What this helper function do is to speed up the Frechet mean when used in the -k-NN regression. The
-k-NN regression computes the Frechet mean of the k nearest neighbours for a value of
and this function does exactly that. Suppose you want to predict the compositional value of some new predictors. For each predictor value you must use the Frechet mean computed at various nearest neighbours. This function performs these computations in a fast way. It is not the fastest way, yet it is a pretty fast way. This function is being called inside the function aknn.reg.
A list where eqch element contains a matrix. Each matrix contains the Frechet means computed at various nearest neighbours.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) library(Rfast) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) xnew <- x[1:10, ] x <- x[-c(1:10), ] k <- 2:5 di <- Rfast::dista( xnew, x, k = max(k), index = TRUE, square = TRUE ) est <- frechet2(x, di, 0.2, k)
library(MASS) library(Rfast) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) xnew <- x[1:10, ] x <- x[-c(1:10), ] k <- 2:5 di <- Rfast::dista( xnew, x, k = max(k), index = TRUE, square = TRUE ) est <- frechet2(x, di, 0.2, k)
Helper functions for the Kullback-Leibler regression.
kl.compreg2(y, x, con = TRUE, xnew = NULL, tol = 1e-07, maxiters = 50) klcompreg.boot(y, x, der, der2, id, b1, n, p, d, tol = 1e-07, maxiters = 50)
kl.compreg2(y, x, con = TRUE, xnew = NULL, tol = 1e-07, maxiters = 50) klcompreg.boot(y, x, der, der2, id, b1, n, p, d, tol = 1e-07, maxiters = 50)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. For the klcompreg.boot the first column is removed. |
x |
The predictor variable(s), they can be either continuous or categorical or both. In the klcompreg.boot this is the design matrix. |
con |
If this is TRUE (default) then the constant term is estimated, otherwise the model includes no constant term. |
xnew |
If you have new data use it, otherwise leave it NULL. |
tol |
The tolerance value to terminate the Newton-Raphson procedure. |
maxiters |
The maximum number of Newton-Raphson iterations. |
der |
An vector to put the first derivative there. |
der2 |
An empty matrix to put the second derivatives there, the Hessian matrix will be put here. |
id |
A help vector with indices. |
b1 |
The matrix with the initial estimated coefficients. |
n |
The sample size |
p |
The number of columns of the design matrix. |
d |
The dimensionality of the simplex, that is the number of columns of the compositional data minus 1. |
These are help functions for the kl.compreg
function. They are not to be called directly by the user.
For kl.compreg2 a list including:
iters |
The nubmer of iterations required by the Newton-Raphson. |
loglik |
The loglikelihood. |
be |
The beta coefficients. |
est |
The fitted or the predicted values (if xnew is not NULL). |
For klcompreg.boot a list including:
loglik |
The loglikelihood. |
be |
The beta coefficients. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Murteira, Jose MR, and Joaquim JS Ramalho 2016. Regression analysis of multivariate fractional data. Econometric Reviews 35(4): 515-552.
diri.reg, js.compreg, ols.compreg, comp.reg
library(MASS) x <- as.vector(fgl[, 1]) y <- as.matrix(fgl[, 2:9]) y <- y / rowSums(y) mod1<- kl.compreg(y, x, B = 1, ncores = 1) mod2 <- js.compreg(y, x, B = 1, ncores = 1)
library(MASS) x <- as.vector(fgl[, 1]) y <- as.matrix(fgl[, 2:9]) y <- y / rowSums(y) mod1<- kl.compreg(y, x, B = 1, ncores = 1) mod2 <- js.compreg(y, x, B = 1, ncores = 1)
Hotelling's test for testing the equality of two Euclidean population mean vectors.
hotel2T2(x1, x2, a = 0.05, R = 999, graph = FALSE)
hotel2T2(x1, x2, a = 0.05, R = 999, graph = FALSE)
x1 |
A matrix containing the Euclidean data of the first group. |
x2 |
A matrix containing the Euclidean data of the second group. |
a |
The significance level, set to 0.05 by default. |
R |
If R is 1 no bootstrap calibration is performed and the classical p-value via the F distribution is returned. If R is greater than 1, the bootstrap p-value is returned. |
graph |
A boolean variable which is taken into consideration only when bootstrap calibration is performed. IF TRUE the histogram of the bootstrap test statistic values is plotted. |
The fist case scenario is when we assume equality of the two covariance matrices. This is called the two-sample Hotelling's test (Mardia, Kent and Bibby, 1979, pg. 131-140) and Everitt (2005, pg. 139). The test statistic is defined as
where is the pooled covariance matrix calculated under the assumption of equal covariance matrices
Under
the statistic
given by
follows the distribution with
and
degrees of freedom. Similar to the one-sample test, an extra argument (R) indicates whether bootstrap calibration should be used or not. If R=1, then the asymptotic theory applies, if R>1, then the bootstrap p-value will be applied and the number of re-samples is equal to R. The estimate of the common mean used in the bootstrap to transform the data under the null hypothesis the mean vector of the combined sample, of all the observations.
The built-in command manova
does the same thing exactly. Try it, the asymptotic test is what you have to see. In addition, this command allows for more mean vector hypothesis testing for more than two groups. I noticed this command after I had written my function and nevertheless as I mention in the introduction this document has an educational character as well.
A list including:
mesoi |
The two mean vectors. |
info |
The test statistic, the p-value, the critical value and the degrees of freedom of the F distribution (numerator and denominator). This is given if no bootstrap calibration is employed. |
pvalue |
The bootstrap p-value is bootstrap is employed. |
note |
A message informing the user that bootstrap calibration has been employed. |
runtime |
The runtime of the bootstrap calibration. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Everitt B. (2005). An R and S-Plus Companion to Multivariate Analysis. Springer.
Mardia K.V., Kent J.T. and Bibby J.M. (1979). Multivariate Analysis. London: Academic Press.
Tsagris M., Preston S. and Wood A.T.A. (2017). Nonparametric hypothesis testing for equality of means on the simplex. Journal of Statistical Computation and Simulation, 87(2): 406–422.
hotel2T2( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]) ) hotel2T2( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]), R = 1 )
hotel2T2( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]) ) hotel2T2( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]), R = 1 )
Hypothesis testing for two or more compositional mean vectors.
comp.test(x, ina, test = "james", R = 0, ncores = 1, graph = FALSE)
comp.test(x, ina, test = "james", R = 0, ncores = 1, graph = FALSE)
x |
A matrix containing compositional data. |
ina |
A numerical or factor variable indicating the groups of the data. |
test |
This can take the values of "james" for James' test, "hotel" for Hotelling's test, "maov" for multivariate analysis of variance assuming equality of the covariance matrices, "maovjames" for multivariate analysis of variance without assuming equality of the covariance matrices. "el" for empirical likelihood or "eel" for exponential empirical likelihood. |
R |
This depends upon the value of the argument "test". If the test is "maov" or "maovjames", R is not taken into consideration. If test is "hotel", then R denotes the number of bootstrap resamples. If test is "james", then R can be 1 (chi-square distribution), 2 ( F distribution), or more for bootstrap calibration. If test is "el", then R can be 0 (chi-square), 1 (corrected chi-sqaure), 2 (F distribution) or more for bootstrap calibration. See the help page of each test for more information. |
ncores |
How many to cores to use. This is taken into consideration only if test is "el" and R is more than 2. |
graph |
A boolean variable which is taken into consideration only when bootstrap calibration is performed. IF TRUE the histogram of the bootstrap test statistic values is plotted. This is taken into account only when R is greater than 2. |
The idea is to apply the -transformation, with
, to the compositional data and then use a test to compare their mean vectors.
See the help page of each test for more information. The function is visible so you can see exactly what is going on.
A list including:
result |
The outcome of each test. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M., Preston S. and Wood A.T.A. (2017). Nonparametric hypothesis testing for equality of means on the simplex. Journal of Statistical Computation and Simulation, 87(2): 406-422.
G.S. James (1954). Tests of Linear Hypothese in Univariate and Multivariate Analysis when the Ratios of the Population Variances are Unknown. Biometrika, 41(1/2): 19-43
Krishnamoorthy K. and Yanping Xia (2006). On Selecting Tests for Equality of Two Normal Mean Vectors. Multivariate Behavioral Research 41(4): 533-548.
Owen A. B. (2001). Empirical likelihood. Chapman and Hall/CRC Press.
Owen A.B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75(2): 237-249.
Amaral G.J.A., Dryden I.L. and Wood A.T.A. (2007). Pivotal bootstrap methods for k-sample problems in directional statistics and shape analysis. Journal of the American Statistical Association 102(478): 695-707.
Preston S.P. and Wood A.T.A. (2010). Two-Sample Bootstrap Hypothesis Tests for Three-Dimensional Labelled Landmark Data. Scandinavian Journal of Statistics 37(4): 568-587.
Jing Bing-Yi and Andrew TA Wood (1996). Exponential empirical likelihood is not Bartlett correctable. Annals of Statistics 24(1): 365-369.
ina <- rep(1:2, each = 50) x <- as.matrix(iris[1:100, 1:4]) x <- x/ rowSums(x) comp.test( x, ina, test = "james" ) comp.test( x, ina, test = "hotel" ) comp.test( x, ina, test = "el" ) comp.test( x, ina, test = "eel" )
ina <- rep(1:2, each = 50) x <- as.matrix(iris[1:100, 1:4]) x <- x/ rowSums(x) comp.test( x, ina, test = "james" ) comp.test( x, ina, test = "hotel" ) comp.test( x, ina, test = "el" ) comp.test( x, ina, test = "eel" )
ICE plot for projection pursuit regression with compositional predictor variables.
ice.pprcomp(model, x, k = 1, frac = 0.1, type = "log")
ice.pprcomp(model, x, k = 1, frac = 0.1, type = "log")
model |
The ppr model, the outcome of the |
x |
A matrix with the compositional data. No zero values are allowed. |
k |
Which variable to select?. |
frac |
Fraction of observations to use. The default value is 0.1. |
type |
Either "alr" or "log" corresponding to the additive log-ratio transformation or the simple logarithm applied to the compositional data. |
This function implements the Individual Conditional Expecation plots of Goldstein et al. (2015). See the references for more details.
A graph with several curves. The horizontal axis contains the selected variable, whereas the vertical axis contains the centered predicted values. The black curves are the effects for each observation and the blue line is their average effect.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
https://christophm.github.io/interpretable-ml-book/ice.html
Goldstein, A., Kapelner, A., Bleich, J. and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics 24(1): 44-65.
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817-823. doi: 10.2307/2287576.
pprcomp, pprcomp.tune, ice.kernreg, alfa.pcr, lc.reg, comp.ppr
x <- as.matrix( iris[, 2:4] ) x <- x/ rowSums(x) y <- iris[, 1] model <- pprcomp(y, x) ice <- ice.pprcomp(model, x, k = 1)
x <- as.matrix( iris[, 2:4] ) x <- x/ rowSums(x) y <- iris[, 1] model <- pprcomp(y, x) ice <- ice.pprcomp(model, x, k = 1)
regression
ICE plot for the regression.
ice.aknnreg(y, x, a, k, apostasi = "euclidean", rann = FALSE, ind = 1, frac = 0.2, qpos = 0.9)
ice.aknnreg(y, x, a, k, apostasi = "euclidean", rann = FALSE, ind = 1, frac = 0.2, qpos = 0.9)
y |
A numerical vector with the response values. |
x |
A numerical matrix with the predictor variables. |
a |
The value |
k |
The number of nearest neighbours to consider. |
apostasi |
The type of distance to use, either "euclidean" or "manhattan". |
rann |
If you have large scale datasets and want a faster k-NN search, you can use kd-trees implemented in the R package "Rnanoflann". In this case you must set this argument equal to TRUE. Note however, that in this case, the only available distance is by default "euclidean". |
ind |
Which variable to select?. |
frac |
Fraction of observations to use. The default value is 0.1. |
qpos |
A number between 0.8 and 1. This is used to place the legend of the figure better. You can play with it. In the worst case scenario the code is open and you tweak this argument as you prefer. |
This function implements the Individual Conditional Expecation plots of Goldstein et al. (2015). See the references for more details.
A graph with several curves, one for each component. The horizontal axis contains the selected variable, whereas the vertical axis contains the locally smoothed predicted compositional lines.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
https://christophm.github.io/interpretable-ml-book/ice.html
Goldstein, A., Kapelner, A., Bleich, J. and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics 24(1): 44-65.
y <- as.matrix( iris[, 2:4] ) x <- iris[, 1] ice <- ice.aknnreg(y, x, a = 0.6, k = 5, ind = 1)
y <- as.matrix( iris[, 2:4] ) x <- iris[, 1] ice <- ice.aknnreg(y, x, a = 0.6, k = 5, ind = 1)
-kernel regression
ICE plot for the -kernel regression.
ice.akernreg(y, x, a, h, type = "gauss", ind = 1, frac = 0.1, qpos = 0.9)
ice.akernreg(y, x, a, h, type = "gauss", ind = 1, frac = 0.1, qpos = 0.9)
y |
A numerical vector with the response values. |
x |
A numerical matrix with the predictor variables. |
a |
The value |
h |
The bandwidth value to consider. |
type |
The type of kernel to use, "gauss" or "laplace". |
ind |
Which variable to select?. |
frac |
Fraction of observations to use. The default value is 0.1. |
qpos |
A number between 0.8 and 1. This is used to place the legend of the figure better. You can play with it. In the worst case scenario the code is open and you tweak this argument as you prefer. |
This function implements the Individual Conditional Expecation plots of Goldstein et al. (2015). See the references for more details.
A graph with several curves, one for each component. The horizontal axis contains the selected variable, whereas the vertical axis contains the locally smoothed predicted compositional lines.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
https://christophm.github.io/interpretable-ml-book/ice.html
Goldstein, A., Kapelner, A., Bleich, J. and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics 24(1): 44-65.
y <- as.matrix( iris[, 2:4] ) x <- iris[, 1] ice <- ice.akernreg(y, x, a = 0.6, h = 0.1, ind = 1)
y <- as.matrix( iris[, 2:4] ) x <- iris[, 1] ice <- ice.akernreg(y, x, a = 0.6, h = 0.1, ind = 1)
ICE plot for univariate kernel regression.
ice.kernreg(y, x, h, type = "gauss", k = 1, frac = 0.1)
ice.kernreg(y, x, h, type = "gauss", k = 1, frac = 0.1)
y |
A numerical vector with the response values. |
x |
A numerical matrix with the predictor variables. |
h |
The bandwidth value to consider. |
type |
The type of kernel to use, "gauss" or "laplace". |
k |
Which variable to select?. |
frac |
Fraction of observations to use. The default value is 0.1. |
This function implements the Individual Conditional Expecation plots of Goldstein et al. (2015). See the references for more details.
A graph with several curves. The horizontal axis contains the selected variable, whereas the vertical axis contains the centered predicted values. The black curves are the effects for each observation and the blue line is their average effect.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
https://christophm.github.io/interpretable-ml-book/ice.html
Goldstein, A., Kapelner, A., Bleich, J. and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics 24(1): 44-65.
ice.pprcomp, kernreg.tune, alfa.pcr, lc.reg
x <- as.matrix( iris[, 2:4] ) y <- iris[, 1] ice <- ice.kernreg(y, x, h = 0.1, k = 1)
x <- as.matrix( iris[, 2:4] ) y <- iris[, 1] ice <- ice.kernreg(y, x, h = 0.1, k = 1)
-transformation
The inverse of the -transformation.
alfainv(x, a, h = TRUE)
alfainv(x, a, h = TRUE)
x |
A matrix with Euclidean data. However, they must lie within the feasible, acceptable space. See references for more information. |
a |
The value of the power transformation, it has to be between -1 and 1.
If zero values are present it has to be greater than 0. If |
h |
If h = TRUE this means that the multiplication with the Helmer sub-matrix will take place. It is set to TRUe by default. |
The inverse of the -transformation is applied to the data.
If the data lie outside the
-space, NAs will be returned for
some values.
A matrix with the pairwise distances.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M. and Stewart C. (2022). A Review of Flexible Transformations for Modeling Compositional Data. In Advances and Innovations in Statistics and Data Science, pp. 225–234. https://link.springer.com/chapter/10.1007/978-3-031-08329-7_10
Tsagris M.T., Preston S. and Wood A.T.A. (2016). Improved classification for
compositional data using the -transformation.
Journal of Classification 33(2): 243–261.
https://arxiv.org/pdf/1506.04976v2.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) x <- as.matrix(fgl[1:10, 2:9]) x <- x / rowSums(x) y <- alfa(x, 0.5)$aff alfainv(y, 0.5)
library(MASS) x <- as.matrix(fgl[1:10, 2:9]) x <- x / rowSums(x) y <- alfa(x, 0.5)$aff alfainv(y, 0.5)
James test for testing the equality of two population mean vectors without assuming equality of the covariance matrices.
james(y1, y2, a = 0.05, R = 999, graph = FALSE)
james(y1, y2, a = 0.05, R = 999, graph = FALSE)
y1 |
A matrix containing the Euclidean data of the first group. |
y2 |
A matrix containing the Euclidean data of the second group. |
a |
The significance level, set to 0.05 by default. |
R |
If R is 1 no bootstrap calibration is performed and the classical p-value via the F distribution is returned. If R is greater than 1, the bootstrap p-value is returned. |
graph |
A boolean variable which is taken into consideration only when bootstrap calibration is performed. If TRUE the histogram of the bootstrap test statistic values is plotted. |
Here we show the modified version of the two-sample test (function
hotel2T2
) in the case where the two covariances matrices cannot be assumed to be equal.
James (1954) proposed a test for linear hypotheses of the population means when the variances (or the covariance matrices) are not known. Its form for two -dimensional samples is:
where
.
James (1954) suggested that the test statistic is compared with , a corrected
distribution whose form is
where
and
.
If you want to do bootstrap to get the p-value, then you must transform the data under the null hypothesis. The estimate of the common mean is given by Aitchison (1986)
The modified Nel and van der Merwe (1986) test is based on the same quadratic form as that of James (1954) but the distribution used to compare the value of the test statistic is different.
It is shown in Krishnamoorthy and Yanping (2006) that approximately, where
The algorithm is taken by Krishnamoorthy and Yu (2004).
A list including:
note |
A message informing the user about the test used. |
mesoi |
The two mean vectors. |
info |
The test statistic, the p-value, the correction factor and the corrected critical value of the chi-square distribution if the James test has been used or, the test statistic, the p-value, the critical value and the degrees of freedom (numerator and denominator) of the F distribution if the modified James test has been used. |
pvalue |
The bootstrap p-value if bootstrap is employed. |
runtime |
The runtime of the bootstrap calibration. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
James G.S. (1954). Tests of Linear Hypothese in Univariate and Multivariate Analysis when the Ratios of the Population Variances are Unknown. Biometrika, 41(1/2): 19–43.
Krishnamoorthy K. and Yu J. (2004). Modified Nel and Van der Merwe test for the multivariate Behrens-Fisher problem. Statistics & Probability Letters, 66(2): 161–169.
Krishnamoorthy K. and Yanping Xia (2006). On Selecting Tests for Equality of Two Normal Mean Vectors. Multivariate Behavioral Research, 41(4): 533–548.
Tsagris M., Preston S. and Wood A.T.A. (2017). Nonparametric hypothesis testing for equality of means on the simplex. Journal of Statistical Computation and Simulation, 87(2): 406–422.
hotel2T2, maovjames, el.test2, eel.test2
james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]), R = 1 ) james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]), R = 2 ) james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]) )
james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]), R = 1 ) james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]), R = 2 ) james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]) )
Kernel regression (Nadaraya-Watson estimator) with a numerical response vector or matrix.
kern.reg(xnew, y, x, h = seq(0.1, 1, length = 10), type = "gauss" )
kern.reg(xnew, y, x, h = seq(0.1, 1, length = 10), type = "gauss" )
xnew |
A matrix with the new predictor variables whose compositions are to be predicted. |
y |
A numerical vector or a matrix with the response value. |
x |
A matrix with the available predictor variables. |
h |
The bandwidth value(s) to consider. |
type |
The type of kernel to use, "gauss" or "laplace". |
The Nadaraya-Watson estimator regression is applied.
The fitted values. If a single bandwidth is considered then this is a vector or a matrix, depeding on the nature of the response. If multiple bandwidth values are considered then this is a matrix, if the response is a vector, or a list, if the response is a matrix.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Wand M. P. and Jones M. C. (1994). Kernel smoothing. CRC press.
kernreg.tune, ice.kernreg, akern.reg, aknn.reg
y <- iris[, 1] x <- iris[, 2:4] est <- kern.reg(x, y, x, h = c(0.1, 0.2) )
y <- iris[, 1] x <- iris[, 2:4] est <- kern.reg(x, y, x, h = c(0.1, 0.2) )
Kullback-Leibler divergence and Bhattacharyya distance between two Dirichlet distributions.
kl.diri(a, b, type = "KL")
kl.diri(a, b, type = "KL")
a |
A vector with the parameters of the first Dirichlet distribution. |
b |
A vector with the parameters of the second Dirichlet distribution. |
type |
A variable indicating whether the Kullback-Leibler divergence ("KL") or the Bhattacharyya distance ("bhatt") is to be computed. |
Note that the order is important in the Kullback-Leibler divergence, since this is asymmetric, but not in the Bhattacharyya distance, since it is a metric.
The value of the Kullback-Leibler divergence or the Bhattacharyya distance.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
library(MASS) a <- runif(10, 0, 20) b <- runif(10, 1, 10) kl.diri(a, b) kl.diri(b, a) kl.diri(a, b, type = "bhatt") kl.diri(b, a, type = "bhatt")
library(MASS) a <- runif(10, 0, 20) b <- runif(10, 1, 10) kl.diri(a, b) kl.diri(b, a) kl.diri(a, b, type = "bhatt") kl.diri(b, a, type = "bhatt")
LASSO Kullback-Leibler divergence based regression.
lasso.klcompreg(y, x, alpha = 1, lambda = NULL, nlambda = 100, type = "grouped", xnew = NULL)
lasso.klcompreg(y, x, alpha = 1, lambda = NULL, nlambda = 100, type = "grouped", xnew = NULL)
y |
A numerical matrix with compositional data. Zero values are allowed. |
x |
A numerical matrix containing the predictor variables. |
alpha |
The elastic net mixing parameter, with |
lambda |
This information is copied from the package glmnet. A user supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this. WARNING: use with care. Avoid supplying a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit. |
nlambda |
This information is copied from the package glmnet. The number of |
type |
This information is copied from the package glmnet.. If "grouped" then a grouped lasso penalty is used on the multinomial coefficients for a variable. This ensures they are all in our out together. The default in our case is "grouped". |
xnew |
If you have new data use it, otherwise leave it NULL. |
The function uses the glmnet package to perform LASSO penalised regression. For more details see the function in that package.
A list including:
mod |
We decided to keep the same list that is returned by glmnet. So, see the function in that package for more information. |
est |
If you supply a matrix in the "xnew" argument this will return an array of many matrices
with the fitted values, where each matrix corresponds to each value of |
Michail Tsagris and Abdulaziz Alenazi.
R implementation and documentation: Michail Tsagris [email protected] and Abdulaziz Alenazi [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Alenazi A. A. (2022). f-divergence regression models for compositional data. Pakistan Journal of Statistics and Operation Research, 18(4): 867–882.
Friedman J., Hastie T. and Tibshirani R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1–22.
lassocoef.plot, cv.lasso.klcompreg, kl.compreg, lasso.compreg, ols.compreg, alfa.pcr, alfa.knn.reg
y <- as.matrix(iris[, 1:4]) y <- y / rowSums(y) x <- matrix( rnorm(150 * 30), ncol = 30 ) a <- lasso.klcompreg(y, x)
y <- as.matrix(iris[, 1:4]) y <- y / rowSums(y) x <- matrix( rnorm(150 * 30), ncol = 30 ) a <- lasso.klcompreg(y, x)
LASSO log-ratio regression with compositional response.
lasso.compreg(y, x, alpha = 1, lambda = NULL, nlambda = 100, xnew = NULL)
lasso.compreg(y, x, alpha = 1, lambda = NULL, nlambda = 100, xnew = NULL)
y |
A numerical matrix with compositional data. Zero values are not allowed as the additive log-ratio
transformation ( |
x |
A numerical matrix containing the predictor variables. |
alpha |
The elastic net mixing parameter, with |
lambda |
This information is copied from the package glmnet. A user supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this. WARNING: use with care. Avoid supplying a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit. |
nlambda |
This information is copied from the package glmnet. The number of |
xnew |
If you have new data use it, otherwise leave it NULL. |
The function uses the glmnet package to perform LASSO penalised regression. For more details see the function in that package.
A list including:
mod |
We decided to keep the same list that is returned by glmnet. So, see the function in that package for more information. |
est |
If you supply a matrix in the "xnew" argument this will return an array of many
matrices with the fitted values, where each matrix corresponds to each value of |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1-22.
cv.lasso.compreg, lassocoef.plot, lasso.klcompreg, cv.lasso.klcompreg,
comp.reg
y <- as.matrix(iris[, 1:4]) y <- y / rowSums(y) x <- matrix( rnorm(150 * 30), ncol = 30 ) a <- lasso.compreg(y, x)
y <- as.matrix(iris[, 1:4]) y <- y / rowSums(y) x <- matrix( rnorm(150 * 30), ncol = 30 ) a <- lasso.compreg(y, x)
-transformation
LASSO with compositional predictors using the -transformation.
alfa.lasso(y, x, a = seq(-1, 1, by = 0.1), model = "gaussian", lambda = NULL, xnew = NULL)
alfa.lasso(y, x, a = seq(-1, 1, by = 0.1), model = "gaussian", lambda = NULL, xnew = NULL)
y |
A numerical vector or a matrix for multinomial logistic regression. |
x |
A numerical matrix containing the predictor variables, compositional data, where zero values are allowed.. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1.
If zero values are present it has to be greater than 0. If |
model |
The type of the regression model, "gaussian", "binomial", "poisson", "multinomial", or "mgaussian". |
lambda |
This information is copied from the package glmnet. A user supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this. WARNING: use with care. Avoid supplying a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit. |
xnew |
If you have new data use it, otherwise leave it NULL. |
The function uses the glmnet package to perform LASSO penalised regression. For more details see the function in that package.
A list including sublists for each value of :
mod |
We decided to keep the same list that is returned by glmnet. So, see the function in that package for more information. |
est |
If you supply a matrix in the "xnew" argument this will return an array of many matrices
with the fitted values, where each matrix corresponds to each value of |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1–22.
alfalasso.tune, cv.lasso.klcompreg, lasso.compreg, alfa.knn.reg
y <- as.matrix(iris[, 1]) x <- rdiri(150, runif(20, 2, 5) ) mod <- alfa.lasso(y, x, a = c(0, 0.5, 1))
y <- as.matrix(iris[, 1]) x <- rdiri(150, runif(20, 2, 5) ) mod <- alfa.lasso(y, x, a = c(0, 0.5, 1))
Log-contrast GLMs with compositional predictor variables.
lc.glm(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
lc.glm(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This is either a binary variable or a vector with counts. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
model |
For the ulc.glm(), this can be either "logistic" or "poisson". |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the log-contrast logistic or Poisson regression model. The logarithm of the
compositional predictor variables is used (hence no zero values are allowed). The response variable
is linked to the log-transformed data with the constraint that the sum of the regression coefficients
equals 0. If you want the regression without the zum-to-zero contraints see ulc.glm
.
Extra predictors variables are allowed as well, for instance categorical or continuous.
A list including:
devi |
The residual deviance of the logistic or Poisson regression model. |
be |
The constrained regression coefficients. Their sum (excluding the constant) equals 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Lu J., Shi P. and Li H. (2019). Generalized linear models with linear constraints for microbiome compositional data. Biometrics, 75(1): 235–244.
ulc.glm, lc.glm2, ulc.glm2, lcglm.aov
y <- rbinom(150, 1, 0.5) x <- rdiri(150, runif(3, 1, 4) ) mod1 <- lc.glm(y, x)
y <- rbinom(150, 1, 0.5) x <- rdiri(150, runif(3, 1, 4) ) mod1 <- lc.glm(y, x)
Log-contrast logistic or Poisson regression with with multiple compositional predictors.
lc.glm2(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
lc.glm2(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This is either a binary variable or a vector with counts. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
model |
This can be either "logistic" or "poisson". |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the log-contrast logistic or Poisson regression model. The logarithm of the
compositional predictor variables is used (hence no zero values are allowed). The response variable
is linked to the log-transformed data with the constraint that the sum of the regression coefficients
equals 0. If you want the regression without the zum-to-zero contraints see ulc.glm2
.
Extra predictors variables are allowed as well, for instance categorical or continuous.
A list including:
devi |
The residual deviance of the logistic or Poisson regression model. |
be |
The constrained regression coefficients. Their sum (excluding the constant) equals 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Lu J., Shi P. and Li H. (2019). Generalized linear models with linear constraints for microbiome compositional data. Biometrics, 75(1): 235–244.
y <- rbinom(150, 1, 0.5) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.glm2(y, x)
y <- rbinom(150, 1, 0.5) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.glm2(y, x)
Log-contrast quantile regression with compositional predictor variables.
lc.rq(y, x, z = NULL, tau, xnew = NULL, znew = NULL)
lc.rq(y, x, z = NULL, tau, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
tau |
The quantile to be estimated, a number between 0 and 1. |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the quantile regression model. The logarithm of the compositional
predictor variables is used (hence no zero values are allowed). The response variable is
linked to the log-transformed data with the constraint that the sum of the regression
coefficients equals 0. If you want the regression without the zum-to-zero contraints see ulc.rq
.
Extra predictor variables are allowed as well, for instance categorical
or continuous.
A list including:
mod |
The object as returned by the function quantreg::rq(). This is useful for hypothesis testing purposes. |
be |
The constrained regression coefficients. Their sum (excluding the constant) equals 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Koenker R. W. and Bassett G. W. (1978). Regression Quantiles, Econometrica, 46(1): 33–50.
Koenker R. W. and d'Orey V. (1987). Algorithm AS 229: Computing Regression Quantiles. Applied Statistics, 36(3): 383–393.
y <- rnorm(150) x <- rdiri(150, runif(3, 1, 4) ) mod1 <- lc.rq(y, x)
y <- rnorm(150) x <- rdiri(150, runif(3, 1, 4) ) mod1 <- lc.rq(y, x)
Log-contrast quantile regression with with multiple compositional predictors.
lc.rq2(y, x, z = NULL, tau = 0.5, xnew = NULL, znew = NULL)
lc.rq2(y, x, z = NULL, tau = 0.5, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
tau |
The quantile to be estimated, a number between 0 and 1. |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the log-contrast quantile regression model. The logarithm
of the compositional predictor variables is used (hence no zero values are allowed).
The response variable is linked to the log-transformed data with the constraint
that the sum of the regression coefficients equals 0. If you want the regression
without the zum-to-zero contraints see ulc.rq2
. Extra predictor
variables are allowed as well, for instance categorical or continuous.
A list including:
mod |
The object as returned by the function quantreg::rq(). This is useful for hypothesis testing purposes. |
be |
The constrained regression coefficients. Their sum (excluding the constant) equals 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Koenker R. W. and Bassett G. W. (1978). Regression Quantiles, Econometrica, 46(1): 33–50.
Koenker R. W. and d'Orey V. (1987). Algorithm AS 229: Computing Regression Quantiles. Applied Statistics, 36(3): 383–393.
y <- rnorm(150) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.rq2(y, x)
y <- rnorm(150) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.rq2(y, x)
Log-contrast regression with compositional predictor variables.
lc.reg(y, x, z = NULL, xnew = NULL, znew = NULL)
lc.reg(y, x, z = NULL, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This must be a continuous variable. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the log-contrast regression model as described in Aitchison (2003), pg. 84-85.
The logarithm of the compositional predictor variables is used (hence no zero values are allowed).
The response variable is linked to the log-transformed data with the constraint that the sum of the
regression coefficients equals 0. Hence, we apply constrained least squares, which has a closed form
solution. The constrained least squares is described in Chapter 8.2 of Hansen (2019). The idea is to
minimise the sum of squares of the residuals under the constraint , where
in our case. If you want the regression without the zum-to-zero contraints see
ulc.reg
.
Extra predictors variables are allowed as well, for instance categorical or continuous.
A list including:
be |
The constrained regression coefficients. Their sum (excluding the constant) equals 0. |
covbe |
The covariance matrix of the constrained regression coefficients. |
va |
The estimated regression variance. |
residuals |
The vector of residuals. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Hansen, B. E. (2022). Econometrics. Princeton University Press.
ulc.reg, lcreg.aov, lc.reg2, alfa.pcr, alfa.knn.reg
y <- iris[, 1] x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod1 <- lc.reg(y, x) mod2 <- lc.reg(y, x, z = iris[, 5])
y <- iris[, 1] x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod1 <- lc.reg(y, x) mod2 <- lc.reg(y, x, z = iris[, 5])
Log-contrast regression with multiple compositional predictors.
lc.reg2(y, x, z = NULL, xnew = NULL, znew = NULL)
lc.reg2(y, x, z = NULL, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This must be a continuous variable. |
x |
A list with multiple matrices with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
xnew |
A matrix containing a list with multiple matrices with compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the log-contrast regression model as described in Aitchison (2003), pg. 84-85.
The logarithm of the compositional predictor variables is used (hence no zero values are allowed).
The response variable is linked to the log-transformed data with the constraint that the sum of the
regression coefficients for each composition equals 0. Hence, we apply constrained least squares,
which has a closed form solution. The constrained least squares is described in Chapter 8.2 of Hansen (2019).
The idea is to minimise the sum of squares of the residuals under the constraint ,
where
in our case. If you want the regression without the zum-to-zero contraints see
ulc.reg2
. Extra predictors variables are allowed as well, for instance categorical
or continuous. The difference with lc.reg
is that instead of one, there are multiple
compositions treated as predictor variables.
A list including:
be |
The constrained regression coefficients. The sum of the sets of coefficients (excluding the constant) corresponding to each predictor composition sums to 0. |
covbe |
If covariance matrix of the constrained regression coefficients. |
va |
The variance of the estimated regression coefficients. |
residuals |
The vector of residuals. |
est |
If the arguments "xnew" and "znew" were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Hansen, B. E. (2022). Econometrics. Princeton University Press.
Xiaokang Liu, Xiaomei Cong, Gen Li, Kendra Maas and Kun Chen (2020). Multivariate Log-Contrast Regression with Sub-Compositional Predictors: Testing the Association Between Preterm Infants' Gut Microbiome and Neurobehavioral Outcome.
ulc.reg2, lc.reg, ulc.reg, lcreg.aov, alfa.pcr, alfa.knn.reg
y <- iris[, 1] x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.reg2(y, x) be <- mod$be sum(be[2:4]) sum(be[5:8]) sum(be[9:13])
y <- iris[, 1] x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.reg2(y, x) be <- mod$be sum(be[2:4]) sum(be[5:8]) sum(be[9:13])
Log-likelihood ratio test for a Dirichlet mean vector.
dirimean.test(x, a)
dirimean.test(x, a)
x |
A matrix with the compositional data. No zero values are allowed. |
a |
A compositional mean vector. The concentration parameter is estimated at first. If the elements do not sum to 1, it is assumed that the Dirichlet parameters are supplied. |
Log-likelihood ratio test is performed for the hypothesis the given vector of parameters "a" describes the compositional data well.
If there are no zeros in the data, a list including:
param |
A matrix with the estimated parameters under the null and the alternative hypothesis. |
loglik |
The log-likelihood under the alternative and the null hypothesis. |
info |
The value of the test statistic and its relevant p-value. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
sym.test, diri.nr, diri.est, rdiri, ddiri
x <- rdiri( 100, c(1, 2, 3) ) dirimean.test(x, c(1, 2, 3) ) dirimean.test( x, c(1, 2, 3)/6 )
x <- rdiri( 100, c(1, 2, 3) ) dirimean.test(x, c(1, 2, 3) ) dirimean.test( x, c(1, 2, 3)/6 )
Log-likelihood ratio test for a symmetric Dirichlet distribution.
sym.test(x)
sym.test(x)
x |
A matrix with the compositional data. No zero values are allowed. |
Log-likelihood ratio test is performed for the hypothesis that all Dirichelt parameters are equal.
A list including:
est.par |
The estimated parameters under the alternative hypothesis. |
one.par |
The value of the estimated parameter under the null hypothesis. |
res |
The loglikelihood under the alternative and the null hypothesis, the value of the test statistic, its relevant p-value and the
associated degrees of freedom, which are actually the dimensionality of the simplex, |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
diri.nr, diri.est, rdiri, dirimean.test
x <- rdiri( 100, c(5, 7, 1, 3, 10, 2, 4) ) sym.test(x) x <- rdiri( 100, c(5, 5, 5, 5, 5) ) sym.test(x)
x <- rdiri( 100, c(5, 7, 1, 3, 10, 2, 4) ) sym.test(x) x <- rdiri( 100, c(5, 5, 5, 5, 5) ) sym.test(x)
Minimized Kullback-Leibler divergence between Dirichlet and logistic normal distributions.
kl.diri.normal(a)
kl.diri.normal(a)
a |
A vector with the parameters of the Dirichlet parameters. |
The function computes the minimized Kullback-Leibler divergence from the Dirichlet distribution to the logistic normal distribution.
The minimized Kullback-Leibler divergence from the Dirichlet distribution to the logistic normal distribution.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data, p. 127. Chapman & Hall.
diri.nr, diri.contour, rdiri, ddiri, dda, diri.reg
a <- runif(5, 1, 5) kl.diri.normal(a)
a <- runif(5, 1, 5) kl.diri.normal(a)
Mixture model selection via BIC.
bic.mixcompnorm(x, G, type = "alr", veo = FALSE, graph = TRUE)
bic.mixcompnorm(x, G, type = "alr", veo = FALSE, graph = TRUE)
x |
A matrix with compositional data. |
G |
A numeric vector with the number of components, clusters, to be considered, e.g. 1:3. |
type |
The type of trasformation to be used, either the additive log-ratio ("alr"), the isometric log-ratio ("ilr") or the pivot coordinate ("pivot") transformation. |
veo |
Stands for "Variables exceed observations". If TRUE then if the number variablesin the model exceeds the number of observations, but the model is still fitted. |
graph |
A boolean variable, TRUE or FALSE specifying whether a graph should be drawn or not. |
The alr or the ilr-transformation is applied to the compositional data first and then mixtures of multivariate Gaussian distributions are fitted. BIC is used to decide on the optimal model and number of components.
A plot with the BIC of the best model for each number of components versus the number of components. A list including:
mod |
A message informing the user about the best model. |
BIC |
The BIC values for every possible model and number of components. |
optG |
The number of components with the highest BIC. |
optmodel |
The type of model corresponding to the highest BIC. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ryan P. Browne, Aisha ElSherbiny and Paul D. McNicholas (2018). mixture: Mixture Models for Clustering and Classification. R package version 1.5.
Ryan P. Browne and Paul D. McNicholas (2014). Estimating Common Principal Components in High Dimensions. Advances in Data Analysis and Classification, 8(2), 217-226.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
mix.compnorm, mix.compnorm.contour, rmixcomp, bic.alfamixnorm
x <- as.matrix( iris[, 1:4] ) x <- x/ rowSums(x) bic.mixcompnorm(x, 1:3, type = "alr", graph = FALSE) bic.mixcompnorm(x, 1:3, type = "ilr", graph = FALSE)
x <- as.matrix( iris[, 1:4] ) x <- x/ rowSums(x) bic.mixcompnorm(x, 1:3, type = "alr", graph = FALSE) bic.mixcompnorm(x, 1:3, type = "ilr", graph = FALSE)
-transformation using BIC
Mixture model selection with the -transformation using BIC.
bic.alfamixnorm(x, G, a = seq(-1, 1, by = 0.1), veo = FALSE, graph = TRUE)
bic.alfamixnorm(x, G, a = seq(-1, 1, by = 0.1), veo = FALSE, graph = TRUE)
x |
A matrix with compositional data. |
G |
A numeric vector with the number of components, clusters, to be considered, e.g. 1:3. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1. If zero values are present
it has to be greater than 0. If |
veo |
Stands for "Variables exceed observations". If TRUE then if the number variablesin the model exceeds the number of observations, but the model is still fitted. |
graph |
A boolean variable, TRUE or FALSE specifying whether a graph should be drawn or not. |
The -transformation is applied to the compositional data first and then mixtures of multivariate Gaussian
distributions are fitted. BIC is used to decide on the optimal model and number of components.
A list including:
abic |
A list that contains the matrices of all BIC values for all values of |
optalpha |
The value of |
optG |
The number of components with the highest BIC. |
optmodel |
The type of model corresponding to the highest BIC. |
If graph is set equal to TRUE a plot with the BIC of the best model for each number of components versus the number of components and a list with the results of the Gaussian mixture model for each value of .
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ryan P. Browne, Aisha ElSherbiny and Paul D. McNicholas (2018). mixture: Mixture Models for Clustering and Classification. R package version 1.5.
Ryan P. Browne and Paul D. McNicholas (2014). Estimating Common Principal Components in High Dimensions. Advances in Data Analysis and Classification, 8(2), 217-226.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
alfa.mix.norm, mix.compnorm, mix.compnorm.contour, rmixcomp, alfa, alfa.knn,
alfa.rda, comp.nb
x <- as.matrix( iris[, 1:4] ) x <- x/ rowSums(x) bic.alfamixnorm(x, 1:3, a = c(0.4, 0.5, 0.6), graph = FALSE)
x <- as.matrix( iris[, 1:4] ) x <- x/ rowSums(x) bic.alfamixnorm(x, 1:3, a = c(0.4, 0.5, 0.6), graph = FALSE)
MLE of the parameters of a multivariate t distribution.
multivt(y, plot = FALSE)
multivt(y, plot = FALSE)
y |
A matrix with continuous data. |
plot |
If plot is TRUE the value of the maximum log-likelihood as a function of the degres of freedom is presented. |
The parameters of a multivariate t distribution are estimated. This is used by the functions comp.den
and bivt.contour
.
A list including:
center |
The location estimate. |
scatter |
The scatter matrix estimate. |
df |
The estimated degrees of freedom. |
loglik |
The log-likelihood value. |
mesos |
The classical mean vector. |
covariance |
The classical covariance matrix. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Nadarajah, S. and Kotz, S. (2008). Estimation methods for the multivariate t distribution. Acta Applicandae Mathematicae, 102(1):99-118.
x <- as.matrix(iris[, 1:4]) multivt(x)
x <- as.matrix(iris[, 1:4]) multivt(x)
MLE of distributions defined in the (0, 1) interval.
beta.est(x, tol = 1e-07) logitnorm.est(x) hsecant01.est(x, tol = 1e-07) kumar.est(x, tol = 1e-07) unitweibull.est(x, tol = 1e-07, maxiters = 100) ibeta.est(x, tol = 1e-07) zilogitnorm.est(x)
beta.est(x, tol = 1e-07) logitnorm.est(x) hsecant01.est(x, tol = 1e-07) kumar.est(x, tol = 1e-07) unitweibull.est(x, tol = 1e-07, maxiters = 100) ibeta.est(x, tol = 1e-07) zilogitnorm.est(x)
x |
A numerical vector with proportions, i.e. numbers in (0, 1) (zeros and ones are not allowed). |
tol |
The tolerance level up to which the maximisation stops. |
maxiters |
The maximum number of iterations the Newton-Raphson algorithm will perform. |
Maximum likelihood estimation of the parameters of some distributions are performed, some of which use the Newton-Raphson. Some distributions and hence the functions do not accept zeros. "logitnorm.mle" fits the logistic normal, hence no Newton-Raphson is required and the "hypersecant01.mle" use the golden ratio search as is it faster than the Newton-Raphson (less computations). The "zilogitnorm.est" stands for the zero inflated logistic normal distribution. The "ibeta.est" fits the zero or the one inflated beta distribution.
A list including:
iters |
The number of iterations required by the Newton-Raphson. |
loglik |
The value of the log-likelihood. |
param |
The estimated parameters. In the case of "hypersecant01.est" this is called "theta" as there is only one parameter. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology. 46(1-2): 79-88.
Jones, M.C. (2009). Kumaraswamy's distribution: A beta-type distribution with some tractability advantages. Statistical Methodology. 6(1): 70-81.
You can also check the relevant wikipedia pages.
x <- rbeta(1000, 1, 4) beta.est(x) ibeta.est(x) x <- runif(1000) hsecant01.est(x) logitnorm.est(x) ibeta.est(x) x <- rbeta(1000, 2, 5) x[sample(1:1000, 50)] <- 0 ibeta.est(x)
x <- rbeta(1000, 1, 4) beta.est(x) ibeta.est(x) x <- runif(1000) hsecant01.est(x) logitnorm.est(x) ibeta.est(x) x <- rbeta(1000, 2, 5) x[sample(1:1000, 50)] <- 0 ibeta.est(x)
MLE of the parameters of a Dirichlet distribution.
diri.est(x, type = "mle")
diri.est(x, type = "mle")
x |
A matrix containing compositional data. |
type |
If you want to estimate the parameters use type="mle". If you want to estimate the mean vector along with the precision parameter, the second parametrisation of the Dirichlet, use type="prec". |
Maximum likelihood estimation of the parameters of a Dirichlet distribution is performed.
A list including:
loglik |
The value of the log-likelihood. |
param |
The estimated parameters. |
phi |
The estimated precision parameter, if type = "prec". |
mu |
The estimated mean vector, if type = "prec". |
runtime |
The run time of the maximisation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Ng Kai Wang, Guo-Liang Tian and Man-Lai Tang (2011). Dirichlet and related distributions: Theory, methods and applications. John Wiley & Sons.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
diri.nr, diri.contour, rdiri, ddiri, dda, diri.reg
x <- rdiri( 100, c(5, 7, 1, 3, 10, 2, 4) ) diri.est(x) diri.est(x, type = "prec")
x <- rdiri( 100, c(5, 7, 1, 3, 10, 2, 4) ) diri.est(x) diri.est(x, type = "prec")
MLE of the Dirichlet distribution via Newton-Rapshon.
diri.nr(x, type = 1, tol = 1e-07)
diri.nr(x, type = 1, tol = 1e-07)
x |
A matrix containing compositional data. Zeros are not allowed. |
type |
Type can either be 1, so that the Newton-Rapshon is used for the maximisation of the log-likelihood, as Minka (2012) suggested or it can be 1. In the latter case the Newton-Raphson algorithm is implemented involving matrix inversions. In addition an even faster implementation has been implemented (in C++) in the package Rfast and is used here. |
tol |
The tolerance level indicating no further increase in the log-likelihood. |
Maximum likelihood estimation of the parameters of a Dirichlet distribution is performed via Newton-Raphson. Initial values suggested by Minka (2003) are used. The estimation is super faster than "diri.est" and the difference becomes really apparent when the sample size and or the dimensions increase. In fact this will work with millions of observations. So in general, I trust this one more than "diri.est".
The only problem I have seen with this method is that if the data are concentrated around a point, say the center of the simplex, it will be hard for this and the previous methods to give estimates of the parameters. In this extremely difficult scenario I would suggest the use of the previous function with the precision parametrization "diri.est(x, type = "prec")". It will be extremely fast and accurate.
A list including:
iter |
The number of iterations required. If the argument "type" is set to 2 this is not returned. |
loglik |
The value of the log-likelihood. |
param |
The estimated parameters. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Thomas P. Minka (2003). Estimating a Dirichlet distribution. http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf
diri.est, diri.contour rdiri, ddiri, dda
x <- rdiri( 100, c(5, 7, 5, 8, 10, 6, 4) ) diri.nr(x) diri.nr(x, type = 2) diri.est(x)
x <- rdiri( 100, c(5, 7, 5, 8, 10, 6, 4) ) diri.nr(x) diri.nr(x, type = 2) diri.est(x)
MLE of the folded model for a given value of .
alpha.mle(x, a) a.mle(a, x)
alpha.mle(x, a) a.mle(a, x)
x |
A matrix with the compositional data. No zero vaues are allowed. |
a |
A value of |
This is a function for choosing or estimating the value of in the
-folded model
(Tsagris and Stewart, 2020). It is called by
a.est
.
If "alpha.mle" is called, a list including:
iters |
The number of iterations the EM algorithm required. |
loglik |
The maximimized log-likelihood of the folded model. |
p |
The estimated probability inside the simplex of the |
mu |
The estimated mean vector of the |
su |
The estimated covariance matrix of the |
If "a.mle" is called, the log-likelihood is returned only.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Stewart C. (2022). A Review of Flexible Transformations for Modeling Compositional Data. In Advances and Innovations in Statistics and Data Science, pp. 225–234. https://link.springer.com/chapter/10.1007/978-3-031-08329-7_10
Tsagris M. and Stewart C. (2020). A folded model for compositional data analysis. Australian and New Zealand Journal of Statistics, 62(2): 249-277. https://arxiv.org/pdf/1802.07330.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
alfa.profile, alfa, alfainv, a.est
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- alfa.tune(x) mod alpha.mle(x, mod[1])
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) mod <- alfa.tune(x) mod alpha.mle(x, mod[1])
MLE of the zero adjusted Dirichlet distribution.
zad.est(y)
zad.est(y)
y |
A matrix with the compositional data. |
A zero adjusted Dirichlet distribution is being fitted and its parameters are estimated.
A list including:
loglik |
The value of the log-likelihood. |
phi |
The precision parameter. If covariates are linked with it (function "diri.reg2"), this will be a vector. |
mu |
The mean vector of the distribution. |
runtime |
The time required by the model.. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Stewart C. (2018). A Dirichlet regression model for compositional data with zeros. Lobachevskii Journal of Mathematics, 39(3): 398–412.
Preprint available from https://arxiv.org/pdf/1410.5011.pdf
zadr, diri.nr, zilogitnorm.est, zeroreplace
y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) mod1 <- diri.nr(y) y[sample(1:450, 15) ] <- 0 mod2 <- zad.est(y)
y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) mod1 <- diri.nr(y) y[sample(1:450, 15) ] <- 0 mod2 <- zad.est(y)
Multivariate analysis of variance without assuming equality of the covariance matrices.
maovjames(x, ina, a = 0.05)
maovjames(x, ina, a = 0.05)
x |
A matrix containing Euclidean data. |
ina |
A numerical or factor variable indicating the groups of the data. |
a |
The significance level, set to 0.005 by default. |
James (1954) also proposed an alternative to MANOVA when the covariance matrices are not assumed equal. The test statistic for samples is
where and
are the sample mean vector and sample size of the
-th sample respectively and
, where
is the covariance matrix of the
-sample mean vector and
is the estimate of the common mean
.
Normally one would compare the test statistic with a , where
are the degrees of freedom with
denoting the number of groups and
the dimensionality of the data. There are
constraints (how many univariate means must be equal, so that the null hypothesis, that all the mean vectors are equal, holds true), that is where these degrees of freedom come from. James (1954) compared the test statistic with a corrected
distribution instead. Let
and
be
and
.
The corrected quantile of the distribution is given as before by
.
A vector with the next 4 elements:
test |
The test statistic. |
correction |
The value of the correction factor. |
corr.critical |
The corrected critical value of the chi-square distribution. |
p-value |
The p-value of the corrected test statistic. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
James G.S. (1954). Tests of Linear Hypotheses in Univariate and Multivariate Analysis when the Ratios of the Population Variances are Unknown. Biometrika, 41(1/2): 19–43.
maovjames( as.matrix(iris[,1:4]), iris[,5] )
maovjames( as.matrix(iris[,1:4]), iris[,5] )
Multivariate analysis of variance assuming equality of the covariance matrices.
maov(x, ina)
maov(x, ina)
x |
A matrix containing Euclidean data. |
ina |
A numerical or factor variable indicating the groups of the data. |
Multivariate analysis of variance assuming equality of the covariance matrices.
A list including:
note |
A message stating whether the |
result |
The test statistic and the p-value. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Johnson R.A. and Wichern D.W. (2007, 6th Edition). Applied Multivariate Statistical Analysis, pg. 302–303.
Todorov V. and Filzmoser P. (2010). Robust Statistic for the One-way MANOVA. Computational Statistics & Data Analysis, 54(1): 37–48.
maov( as.matrix(iris[,1:4]), iris[,5] ) maovjames( as.matrix(iris[,1:4]), iris[,5] )
maov( as.matrix(iris[,1:4]), iris[,5] ) maovjames( as.matrix(iris[,1:4]), iris[,5] )
Multivariate kernel density estimation.
mkde(x, h = NULL, thumb = "silverman")
mkde(x, h = NULL, thumb = "silverman")
x |
A matrix with Euclidean (continuous) data. |
h |
The bandwidh value. It can be a single value, which is turned into a vector and then into a diagonal matrix, or a vector which is turned into a diagonal matrix. If you put this NULL then you need to specify the "thumb" argument below. |
thumb |
Do you want to use a rule of thumb for the bandwidth parameter? If no, set h equal to NULL and put "estim" for maximum likelihood cross-validation, "scott" or "silverman" for Scott's and Silverman's rules of thumb respectively. |
The multivariate kernel density estimate is calculated with a (not necssarily given) bandwidth value.
A vector with the density estimates calculated for every vector.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Arsalane Chouaib Guidoum (2015). Kernel Estimator and Bandwidth Selection for Density and its Derivatives. The kedd R package.
M.P. Wand and M.C. Jones (1995). Kernel smoothing, pages 91-92.
B.W. Silverman (1986). Density estimation for statistics and data analysis, pages 76-78.
mkde( as.matrix(iris[, 1:4]), thumb = "scott" ) mkde( as.matrix(iris[, 1:4]), thumb = "silverman" )
mkde( as.matrix(iris[, 1:4]), thumb = "scott" ) mkde( as.matrix(iris[, 1:4]), thumb = "silverman" )
Multivariate kernel density estimation for compositional data.
comp.kern(x, type= "alr", h = NULL, thumb = "silverman")
comp.kern(x, type= "alr", h = NULL, thumb = "silverman")
x |
A matrix with Euclidean (continuous) data. |
type |
The type of trasformation used, either the additive log-ratio ("alr"), the isometric log-ratio ("ilr") or the pivot coordinate ("pivot") transformation. |
h |
The bandwidh value. It can be a single value, which is turned into a vector and then into a diagonal matrix, or a vector which is turned into a diagonal matrix. If it is NULL, then you need to specify the "thumb" argument below. |
thumb |
Do you want to use a rule of thumb for the bandwidth parameter? If no, leave the "h" NULL and put "estim" for maximum likelihood cross-validation, "scott" or "silverman" for Scott's and Silverman's rules of thumb respectively. |
The multivariate kernel density estimate is calculated with a (not necssarily given) bandwidth value.
A vector with the density estimates calculated for every vector.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Arsalane Chouaib Guidoum (2015). Kernel Estimator and Bandwidth Selection for Density and its Derivatives.
The kedd R package.
M.P. Wand and M.C. Jones (1995). Kernel smoothing, pages 91-92.
B.W. Silverman (1986). Density estimation for statistics and data analysis, pages 76-78.
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) f <- comp.kern(x)
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) f <- comp.kern(x)
Multivariate linear regression.
multivreg(y, x, plot = TRUE, xnew = NULL)
multivreg(y, x, plot = TRUE, xnew = NULL)
y |
A matrix with the Eucldidean (continuous) data. |
x |
A matrix with the predictor variable(s), they have to be continuous. |
plot |
Should a plot appear or not? |
xnew |
If you have new data use it, otherwise leave it NULL. |
The classical multivariate linear regression model is obtained.
A list including:
suma |
A summary as produced by |
r.squared |
The value of the |
resid.out |
A vector with number indicating which vectors are potential residual outliers. |
x.leverage |
A vector with number indicating which vectors are potential outliers in the predictor variables space. |
out |
A vector with number indicating which vectors are potential outliers in the residuals and in the predictor variables space. |
est |
The predicted values if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
K.V. Mardia, J.T. Kent and J.M. Bibby (1979). Multivariate Analysis. Academic Press.
diri.reg, js.compreg, kl.compreg, ols.compreg, comp.reg
library(MASS) x <- as.matrix(iris[, 1:2]) y <- as.matrix(iris[, 3:4]) multivreg(y, x, plot = TRUE)
library(MASS) x <- as.matrix(iris[, 1:2]) y <- as.matrix(iris[, 3:4]) multivreg(y, x, plot = TRUE)
Multivariate normal random values simulation on the simplex.
rcompnorm(n, m, s, type = "alr")
rcompnorm(n, m, s, type = "alr")
n |
The sample size, a numerical value. |
m |
The mean vector in |
s |
The covariance matrix in |
type |
The alr (type = "alr") or the ilr (type = "ilr") is to be used for closing the Euclidean data onto the simplex. |
The algorithm is straightforward, generate random values from a multivariate normal distribution in and brings the
values to the simplex
using the inverse of a log-ratio transformation.
A matrix with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
comp.den, rdiri, rcompt, rcompsn
x <- as.matrix(iris[, 1:2]) m <- colMeans(x) s <- var(x) y <- rcompnorm(100, m, s) comp.den(y) ternary(y)
x <- as.matrix(iris[, 1:2]) m <- colMeans(x) s <- var(x) y <- rcompnorm(100, m, s) comp.den(y) ternary(y)
-transformation
Multivariate or univariate regression with compositional data in the covariates side using the -transformation.
alfa.pcr(y, x, a, k, model = "gaussian", xnew = NULL)
alfa.pcr(y, x, a, k, model = "gaussian", xnew = NULL)
y |
A numerical vector containing the response variable values. They can be continuous, binary, discrete (counts). This can also be a vector with discrete values or a factor for the multinomial regression (model = "multinomial"). |
x |
A matrix with the predictor variables, the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0.
If |
k |
How many principal components to use. You may also specify a vector and in this case the results produced will refer to each number of principal components. |
model |
The type of regression model to fit. The possible values are "gaussian", "multinomial", "binomial" and "poisson". |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
The -transformation is applied to the compositional data first ,the first k principal component scores are calcualted and used as predictor variables for a regression model. The family of distributions can be either, "normal" for continuous response and hence normal distribution, "binomial" corresponding to binary response and hence logistic regression or "poisson" for count response and poisson regression.
A list tincluding:
be |
If linear regression was fitted, the regression coefficients of the k principal component scores on the response variable y. |
mod |
If another regression model was fitted its outcome as produced in the package Rfast. |
per |
The percentage of variance explained by the first k principal components. |
vec |
The first k principal components, loadings or eigenvectors. These are useful for future prediction in the sense that one needs not fit the whole model again. |
est |
If the argument "xnew" was given these are the predicted or estimated values (if xnew is not NULL). If the argument |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2015). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2): 47-57. https://arxiv.org/pdf/1508.01913v1.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- alfa.pcr(y = y, x = x, 0.7, 1) mod
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- alfa.pcr(y = y, x = x, 0.7, 1) mod
Multivariate regression with compositional data.
comp.reg(y, x, type = "classical", xnew = NULL, yb = NULL)
comp.reg(y, x, type = "classical", xnew = NULL, yb = NULL)
y |
A matrix with compsitional data. Zero values are not allowed. |
x |
The predictor variable(s), they have to be continuous. |
type |
The type of regression to be used, "classical" for standard multivariate regression, or "spatial" for the robust spatial median regression. Alternatively you can type "lmfit" for the fast classical multivariate regression that does not return standard errors whatsoever. |
xnew |
This is by default set to NULL. If you have new data whose compositional data values you want to predict, put them here. |
yb |
If you have already transformed the data using the additive log-ratio transformation, plut it here. Othewrise leave it NULL.
This is intended to be used in the function |
The additive log-ratio transformation is applied and then the chosen multivariate regression is implemented. The alr is easier to explain than the ilr and that is why the latter is avoided here.
A list including:
runtime |
The time required by the regression. |
be |
The beta coefficients. |
seb |
The standard error of the beta coefficients. |
est |
The fitted values of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Mardia K.V., Kent J.T., and Bibby J.M. (1979). Multivariate analysis. Academic press.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
multivreg, spatmed.reg, js.compreg, diri.reg
library(MASS) y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) x <- as.vector(iris[, 4]) mod1 <- comp.reg(y, x) mod2 <- comp.reg(y, x, type = "spatial")
library(MASS) y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) x <- as.vector(iris[, 4]) mod1 <- comp.reg(y, x) mod2 <- comp.reg(y, x, type = "spatial")
Multivariate skew normal random values simulation on the simplex.
rcompsn(n, xi, Omega, alpha, dp = NULL, type = "alr")
rcompsn(n, xi, Omega, alpha, dp = NULL, type = "alr")
n |
The sample size, a numerical value. |
xi |
A numeric vector of length |
Omega |
A |
alpha |
A numeric vector which regulates the slant of the density. |
dp |
A list with three elements, corresponding to xi, Omega and alpha described above. The default value is FALSE. If dp is assigned, individual parameters must not be specified. |
type |
The alr (type = "alr") or the ilr (type = "ilr") is to be used for closing the Euclidean data onto the simplex. |
The algorithm is straightforward, generate random values from a multivariate t distribution in and brings the
values to the simplex
using the inverse of a log-ratio transformation.
A matrix with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Azzalini, A. and Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83(4): 715-726.
Azzalini, A. and Capitanio, A. (1999). Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society Series B, 61(3):579-602. Full-length version available from http://arXiv.org/abs/0911.2093
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
x <- as.matrix(iris[, 1:2]) par <- sn::msn.mle(y = x)$dp y <- rcompsn(100, dp = par) comp.den(y, dist = "skewnorm") ternary(y)
x <- as.matrix(iris[, 1:2]) par <- sn::msn.mle(y = x)$dp y <- rcompsn(100, dp = par) comp.den(y, dist = "skewnorm") ternary(y)
Multivariate t random values simulation on the simplex.
rcompt(n, m, s, dof, type = "alr")
rcompt(n, m, s, dof, type = "alr")
n |
The sample size, a numerical value. |
m |
The mean vector in |
s |
The covariance matrix in |
dof |
The degrees of freedom. |
type |
The alr (type = "alr") or the ilr (type = "ilr") is to be used for closing the Euclidean data onto the simplex. |
The algorithm is straightforward, generate random values from a multivariate t distribution in and brings the
values to the simplex
using the inverse of a log-ratio transformation.
A matrix with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
x <- as.matrix(iris[, 1:2]) m <- Rfast::colmeans(x) s <- var(x) y <- rcompt(100, m, s, 10) comp.den(y, dist = "t") ternary(y)
x <- as.matrix(iris[, 1:2]) m <- Rfast::colmeans(x) s <- var(x) y <- rcompt(100, m, s, 10) comp.den(y, dist = "t") ternary(y)
Naive Bayes classifiers for compositional data.
comp.nb(xnew = NULL, x, ina, type = "beta")
comp.nb(xnew = NULL, x, ina, type = "beta")
xnew |
A matrix with the new compositional predictor data whose class you want to predict. Zeros are not allowed |
x |
A matrix with the available compositional predictor data. Zeros are not allowed |
ina |
A vector of data. The response variable, which is categorical (factor is acceptable). |
type |
The type of naive Bayes, "beta", "logitnorm", "cauchy", "laplace", "gamma", "normlog" or "weibull". For the last 4 distributions, the negative of the logarithm of the compositional data is applied first. |
Depending on the classifier a list including (the ni and est are common for all classifiers):
shape |
A matrix with the shape parameters. |
scale |
A matrix with the scale parameters. |
expmu |
A matrix with the mean parameters. |
sigma |
A matrix with the (MLE, hence biased) variance parameters. |
location |
A matrix with the location parameters (medians). |
scale |
A matrix with the scale parameters. |
mean |
A matrix with the scale parameters. |
var |
A matrix with the variance parameters. |
a |
A matrix with the "alpha" parameters. |
b |
A matrix with the "beta" parameters. |
ni |
The sample size of each group in the dataset. |
est |
The estimated group of the xnew observations. It returns a numerical value back regardless of the target variable being numerical as well or factor. Hence, it is suggested that you do \"as.numeric(ina)\" in order to see what is the predicted class of the new data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
cv.compnb, alfa.rda, alfa.knn, comp.knn, mix.compnorm, dda
x <- Compositional::rdiri(100, runif(5) ) ina <- rbinom(100, 1, 0.5) + 1 a <- comp.nb(x, x, ina, type = "beta")
x <- Compositional::rdiri(100, runif(5) ) ina <- rbinom(100, 1, 0.5) + 1 a <- comp.nb(x, x, ina, type = "beta")
-transformation
Naive Bayes classifiers for compositional data using the -transformation.
alfa.nb(xnew, x, ina, a, type = "gaussian")
alfa.nb(xnew, x, ina, a, type = "gaussian")
xnew |
A matrix with the new compositional predictor data whose class you want to predict. Zeros are allowed. |
x |
A matrix with the available compositional predictor data. Zeros are allowed. |
ina |
A vector of data. The response variable, which is categorical (factor is acceptable). |
a |
This can be a vector of values or a single number. |
type |
The type of naive Bayes, "gaussian", "cauchy" or "laplace". |
The -transformation is applied to the compositional and a naive Bayes classifier is employed.
A matrix with the estimated groups. One column for each value of .
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
comp.nb, alfa.rda, alfa.knn, comp.knn, mix.compnorm
x <- Compositional::rdiri(100, runif(5) ) ina <- rbinom(100, 1, 0.5) + 1 mod <- alfa.nb(x, x, a = c(0, 0.1, 0.2), ina )
x <- Compositional::rdiri(100, runif(5) ) ina <- rbinom(100, 1, 0.5) + 1 mod <- alfa.nb(x, x, a = c(0, 0.1, 0.2), ina )
Non linear least squares regression for compositional data.
ols.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL)
ols.compreg(y, x, con = TRUE, B = 1, ncores = 1, xnew = NULL)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix or a data frame with the predictor variable(s). |
con |
If this is TRUE (default) then the constant term is estimated, otherwise the model includes no constant term. |
B |
If B is greater than 1 bootstrap estimates of the standard error are returned. If B=1, no standard errors are returned. |
ncores |
If ncores is 2 or more parallel computing is performed. This is to be used for the case of bootstrap. If B=1, this is not taken into consideration. |
xnew |
If you have new data use it, otherwise leave it NULL. |
The ordinary least squares between the observed and the fitted compositional data is adopted as the objective function. This involves numerical optimization since the relationship is non linear. There is no log-likelihood.
A list including:
runtime |
The time required by the regression. |
beta |
The beta coefficients. |
covbe |
The covariance matrix of the beta coefficients. If B=1, this is based on the observed information (Hessian matrix), otherwise if B> this is the bootstrap estimate. |
est |
The fitted of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Murteira, Jose MR, and Joaquim JS Ramalho 2016. Regression analysis of multivariate fractional data. Econometric Reviews 35(4): 515-552.
diri.reg, js.compreg, kl.compreg, comp.reg, comp.reg, alfa.reg
library(MASS) x <- as.vector(fgl[, 1]) y <- as.matrix(fgl[, 2:9]) y <- y / rowSums(y) mod1 <- ols.compreg(y, x, B = 1, ncores = 1) mod2 <- js.compreg(y, x, B = 1, ncores = 1)
library(MASS) x <- as.vector(fgl[, 1]) y <- as.matrix(fgl[, 2:9]) y <- y / rowSums(y) mod1 <- ols.compreg(y, x, B = 1, ncores = 1) mod2 <- js.compreg(y, x, B = 1, ncores = 1)
Non-parametric zero replacement strategies.
zeroreplace(x, a = 0.65, delta = NULL, type = "multiplicative")
zeroreplace(x, a = 0.65, delta = NULL, type = "multiplicative")
x |
A matrix with the compositional data. |
a |
The replacement value ( |
delta |
Unless you specify the replacement value |
type |
This can be any of "multiplicative", "additive" or "simple". See the references for more details. |
The "additive" is the zero replacement strategy suggested in Aitchison (1986, pg. 269). All of the three strategies can be found in Martin-Fernandez et al. (2003).
A matrix with the zero replaced compositional data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Martin-Fernandez J. A., Barcelo-Vidal C. & Pawlowsky-Glahn, V. (2003). Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35(3): 253-278.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
x <- as.matrix(iris[1:20, 1:4]) x <- x/ rowSums(x) x[ sample(1:20, 4), sample(1:4, 1) ] <- 0 x <- x / rowSums(x) zeroreplace(x)
x <- as.matrix(iris[1:20, 1:4]) x <- x/ rowSums(x) x[ sample(1:20, 4), sample(1:4, 1) ] <- 0 x <- x / rowSums(x) zeroreplace(x)
Permutation linear independence test in the SCLS model.
scls.indeptest(y, x, R = 999)
scls.indeptest(y, x, R = 999)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are allowed. |
R |
The number of permutations to perform. |
Permutation independence test in the constrained linear least squares for compositional
responses and predictors is performed. The observed test statistic is the MSE computed by scls
. Then, the rows of X are permuted B times and each time the constrained OLS is performed and the MSE is computed. The p-value is then computed in the usual way.
The p-value for the test of independence between Y and X.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
scls, scls2, tflr, scls.betest
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) scls.indeptest(y, x, R = 99)
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) scls.indeptest(y, x, R = 99)
Permutation linear independence test in the TFLR model.
tflr.indeptest(y, x, R = 999, ncores = 1)
tflr.indeptest(y, x, R = 999, ncores = 1)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are in general allowed, but there can be cases when these are problematic. |
R |
The number of permutations to perform. |
ncores |
The number of cores to use in case you are interested for parallel computations. |
Permutation independence test in the constrained linear least squares for compositional
responses and predictors is performed. The observed test statistic is the Kullback-Leibler divergence computed by tflr
. Then, the rows of X are permuted B times and each time the TFLR is performed and the Kullback-Leibler is computed. The p-value is then computed in the usual way.
The p-value for the test of linear independence between the simplicial response Y and the simplicial predictor X.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) tflr.indeptest(y, x, R = 9)
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) tflr.indeptest(y, x, R = 9)
Permutation test for the matrix of coefficients in the SCLS model.
scls.betest(y, x, B, R = 999)
scls.betest(y, x, B, R = 999)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are allowed. |
B |
A specific matrix of coefficients to test. Under the null hypothesis, the matrix of coefficients is equal to this matrix. |
R |
The number of permutations to perform. |
Permutation independence test in the constrained linear least squares for compositional
responses and predictors is performed. The observed test statistic is the MSE computed by scls
. Then, the rows of X are permuted B times and each time the constrained OLS is performed and the MSE is computed. The p-value is then computed in the usual way.
The p-value for the test of independence between Y and X.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
scls, scls2, tflr, scls.indeptest,
tflr.indeptest
y <- rdiri(100, runif(3, 1, 3) ) x <- rdiri(100, runif(3, 1, 3) ) B <- diag(3) scls.betest(y, x, B = B, R = 99)
y <- rdiri(100, runif(3, 1, 3) ) x <- rdiri(100, runif(3, 1, 3) ) B <- diag(3) scls.betest(y, x, B = B, R = 99)
Permutation test for the matrix of coefficients in the TFLR model.
tflr.betest(y, x, B, R = 999, ncores = 1)
tflr.betest(y, x, B, R = 999, ncores = 1)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are in general allowed, but there can be cases when these are problematic. |
B |
A specific matrix of coefficients to test. Under the null hypothesis, the matrix of coefficients is equal to this matrix. |
R |
The number of permutations to perform. |
ncores |
The number of cores to use in case you are interested for parallel computations. |
Permutation independence test in the constrained linear least squares for compositional
responses and predictors is performed. The observed test statistic is the Kullback-Leibler divergence computed by tflr
. Then, the rows of X are permuted B times and each time the TFLR is performed and the Kullback-Leibler is computed. The p-value is then computed in the usual way.
The p-value for the test of linear independence between the simplicial response Y and the simplicial predictor X.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
tflr, tflr.indeptest, scls, scls.indeptest
y <- rdiri(100, runif(3, 1, 3) ) x <- rdiri(100, runif(3, 1, 3) ) B <- diag(3) tflr.betest(y, x, B = B, R = 99)
y <- rdiri(100, runif(3, 1, 3) ) x <- rdiri(100, runif(3, 1, 3) ) B <- diag(3) tflr.betest(y, x, B = B, R = 99)
Perturbation operation.
perturbation(x, y, oper = "+")
perturbation(x, y, oper = "+")
x |
A matrix with the compositional data. |
y |
Either a matrix with compositional data or a vector with compositional data. In either case, the data may not be compositional data, as long as they non negative. |
oper |
For the summation this must be "*" and for the negation it must be "/". According to Aitchison (1986), multiplication is equal to summation in the log-space, and division is equal to negation. |
This is the perturbation operation defined by Aitchison (1986).
A matrix with the perturbed compositional data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
x <- as.matrix(iris[1:15, 1:4]) y <- as.matrix(iris[21:35, 1:4]) perturbation(x, y) perturbation(x, y[1, ])
x <- as.matrix(iris[1:15, 1:4]) y <- as.matrix(iris[21:35, 1:4]) perturbation(x, y) perturbation(x, y[1, ])
Plot of the LASSO coefficients.
lassocoef.plot(lasso, lambda = TRUE)
lassocoef.plot(lasso, lambda = TRUE)
lasso |
An object where you have saved the result of the LASSO regression. See the examples for more details. |
lambda |
If you want the x-axis to contain the logarithm of the penalty parameter |
This function plots the -norm of the coefficients of each predictor variable versus the
or the
-norm of the coefficients. This is the same plot as the one produced
by the glmnet package with type.coef = "2norm".
A plot of the -norm of the coefficients of each predictor variable (y-axis) versus the
-norm
of all the coefficients (x-axis).
Michail Tsagris and Abdulaziz Alenazi.
R implementation and documentation: Michail Tsagris [email protected] and Abdulaziz Alenazi [email protected]. [email protected].
Alenazi, A. A. (2022). f-divergence regression models for compositional data. Pakistan Journal of Statistics and Operation Research, 18(4): 867–882.
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1–22.
lasso.klcompreg, cv.lasso.klcompreg, lasso.compreg, cv.lasso.compreg,
kl.compreg, comp.reg
y <- as.matrix(iris[, 1:4]) y <- y / rowSums(y) x <- matrix( rnorm(150 * 30), ncol = 30 ) a <- lasso.klcompreg(y, x) lassocoef.plot(a) b <- lasso.compreg(y, x) lassocoef.plot(b)
y <- as.matrix(iris[, 1:4]) y <- y / rowSums(y) x <- matrix( rnorm(150 * 30), ncol = 30 ) a <- lasso.klcompreg(y, x) lassocoef.plot(a) b <- lasso.compreg(y, x) lassocoef.plot(b)
Power operation.
pow(x, a)
pow(x, a)
x |
A matrix with the compositional data. |
a |
Either a vector with numbers of a single number. |
This is the power operation defined by Aitchison (1986). It is also the starting point of the -transformation.
A matrix with the power transformed compositional data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. http://arxiv.org/pdf/1106.1451.pdf
x <- as.matrix(iris[1:15, 1:4]) a <- runif(1) pow(x, a)
x <- as.matrix(iris[1:15, 1:4]) a <- runif(1) pow(x, a)
Principal component analysis.
logpca(x, center = TRUE, scale = TRUE, k = NULL, vectors = FALSE)
logpca(x, center = TRUE, scale = TRUE, k = NULL, vectors = FALSE)
x |
A matrix with the compositional data. Zero values are not allowed. |
center |
Do you want your data centered? TRUE or FALSE. |
scale |
Do you want each of your variables scaled, i.e. to have unit variance? TRUE or FALSE. |
k |
If you want a specific number of eigenvalues and eigenvectors set it here, otherwise all eigenvalues (and eigenvectors if requested) will be returned. |
vectors |
Do you want the eigenvectors be returned? By dafault this is FALSE. |
The logarithm is applied to the compositional data and PCA is performed.
A list including:
values |
The eigenvalues. |
vectors |
The eigenvectors. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
alfa.pca, alfa.pcr, kl.alfapcr
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- logpca(x)
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- logpca(x)
-transformation
Principal component analysis using the -transformation.
alfa.pca(x, a, center = TRUE, scale = TRUE, k = NULL, vectors = FALSE)
alfa.pca(x, a, center = TRUE, scale = TRUE, k = NULL, vectors = FALSE)
x |
A matrix with the compositional data. Zero values are allowed. In that case "a" should be positive. |
a |
The value of |
center |
Do you want your data centered? TRUE or FALSE. |
scale |
Do you want each of your variables scaled, i.e. to have unit variance? TRUE or FALSE. |
k |
If you want a specific number of eigenvalues and eigenvectors set it here, otherwise all eigenvalues (and eigenvectors if requested) will be returned. |
vectors |
Do you want the eigenvectors be returned? By dafault this is FALSE. |
The -transformation is applied to the compositional data and then
PCA is performed. Note however, that the right multiplication by the Helmert
sub-matrix is not applied in order to be in accordance with Aitchison (1983).
When
, this results to the PCA proposed by Aitchison (1983).
A list including:
values |
The eigenvalues. |
vectors |
The eigenvectors. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika, 70(1), 57-65.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. http://arxiv.org/pdf/1106.1451.pdf
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- alfa.pca(x, 0.5)
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- alfa.pca(x, 0.5)
Principal component generalised linear models.
glm.pcr(y, x, k = 1, xnew = NULL)
glm.pcr(y, x, k = 1, xnew = NULL)
y |
A numerical vector with 0 and 1 (binary) or a vector with discrete (count) data. |
x |
A matrix with the predictor variable(s), they have to be continuous. |
k |
A number greater than or equal to 1. How many principal components to use. You may get results for the sequence of principal components. |
xnew |
If you have new data use it, otherwise leave it NULL. |
Principal component regression is performed with binary logistic or Poisson regression,
depending on the nature of the response variable. The principal components of the cross product
of the independent variables are obtained and classical regression is performed. This is used
in the function alfa.pcr
.
A list including:
model |
The summary of the logistic or Poisson regression model as returned by the package Rfast. |
per |
The percentage of variance of the predictor variables retained by the k principal components. |
vec |
The principal components, the loadings. |
est |
The fitted or the predicted values (if xnew is not NULL). If the argument |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aguilera A.M., Escabias M. and Valderrama M.J. (2006). Using principal components for estimating logistic regression with high-dimensional multicollinear data. Computational Statistics & Data Analysis 50(8): 1905-1924.
Jolliffe I.T. (2002). Principal Component Analysis.
x <- as.matrix(iris[, 1:4]) y <- rbinom(150, 1, 0.6) mod <- glm.pcr(y, x, k = 1)
x <- as.matrix(iris[, 1:4]) y <- rbinom(150, 1, 0.6) mod <- glm.pcr(y, x, k = 1)
-distance
Principal coordinate analysis using the -distance.
alfa.mds(x, a, k = 2, eig = TRUE)
alfa.mds(x, a, k = 2, eig = TRUE)
x |
A matrix with the compositional data. Zero values are allowed. |
a |
The value of a. In case of zero values in the data it has to be greater than 1. |
k |
The maximum dimension of the space which the data are to be represented in. This can be a number between
1 and |
eig |
Should eigenvalues be returned? The default value is TRUE. |
The function computes the -distance matrix and then plugs it into the classical
multidimensional scaling function in the "cmdscale" function.
A list with the results of "cmdscale" function.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Cox, T. F. and Cox, M. A. A. (2001). Multidimensional Scaling. Second edition. Chapman and Hall.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Chapter 14 of Multivariate Analysis, London: Academic Press.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- esov.mds(x)
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- esov.mds(x)
Principal coordinate analysis using the Jensen-Shannon divergence.
esov.mds(x, k = 2, eig = TRUE)
esov.mds(x, k = 2, eig = TRUE)
x |
A matrix with the compositional data. Zero values are allowed. |
k |
The maximum dimension of the space which the data are to be represented in. This can be a number between
1 and |
eig |
Should eigenvalues be returned? The default value is TRUE. |
The function computes the Jensen-Shannon divergence matrix and then plugs it into the classical multidimensional scaling function in the "cmdscale" function.
A list with the results of "cmdscale" function.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Cox, T. F. and Cox, M. A. A. (2001). Multidimensional Scaling. Second edition. Chapman and Hall.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Chapter 14 of Multivariate Analysis, London: Academic Press.
Tsagris, Michail (2015). A novel, divergence based, regression for compositional data. Proceedings of the 28th Panhellenic Statistics Conference, 15-18/4/2015, Athens, Greece. https://arxiv.org/pdf/1511.07600.pdf
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- esov.mds(x)
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) a <- esov.mds(x)
Projection pursuit regression for compositional data.
comp.ppr(y, x, nterms = 3, type = "alr", xnew = NULL, yb = NULL )
comp.ppr(y, x, nterms = 3, type = "alr", xnew = NULL, yb = NULL )
y |
A matrix with the compositional data. |
x |
A matrix with the continuous predictor variables or a data frame including categorical predictor variables. |
nterms |
The number of terms to include in the final model. |
type |
Either "alr" or "ilr" corresponding to the additive or the isometric log-ratio transformation respectively. |
xnew |
If you have new data use it, otherwise leave it NULL. |
yb |
If you have already transformed the data using a log-ratio transformation put it here. Othewrise leave it NULL. |
This is the standard projection pursuit. See the built-in function "ppr" for more details.
A list includign:
runtime |
The runtime of the regression. |
mod |
The produced model as returned by the function "ppr". |
est |
The fitted values of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817-823. doi: 10.2307/2287576.
compppr.tune, aknn.reg, akern.reg, comp.reg, kl.compreg, alfa.reg
y <- as.matrix(iris[, 1:3]) y <- y/ rowSums(y) x <- iris[, 4] mod <- comp.ppr(y, x)
y <- as.matrix(iris[, 1:3]) y <- y/ rowSums(y) x <- iris[, 4] mod <- comp.ppr(y, x)
Projection pursuit regression with compositional predictor variables.
pprcomp(y, x, nterms = 3, type = "log", xnew = NULL)
pprcomp(y, x, nterms = 3, type = "log", xnew = NULL)
y |
A numerical vector with the continuous variable. |
x |
A matrix with the compositional data. No zero values are allowed. |
nterms |
The number of terms to include in the final model. |
type |
Either "alr" or "log" corresponding to the additive log-ratio transformation or the simple logarithm applied to the compositional data. |
xnew |
If you have new data use it, otherwise leave it NULL. |
This is the standard projection pursuit. See the built-in function "ppr" for more details. When the data are transformed with the additive log-ratio transformation this is close in spirit to the log-contrast regression.
A list including:
runtime |
The runtime of the regression. |
mod |
The produced model as returned by the function "ppr". |
est |
The fitted values of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817-823. doi: 10.2307/2287576.
pprcomp.tune, ice.pprcomp, alfa.pcr, lc.reg, comp.ppr
x <- as.matrix( iris[, 2:4] ) x <- x/ rowSums(x) y <- iris[, 1] pprcomp(y, x)
x <- as.matrix( iris[, 2:4] ) x <- x/ rowSums(x) y <- iris[, 1] pprcomp(y, x)
-transformation
Projection pursuit regression with compositional predictor variables using the -transformation.
alfa.pprcomp(y, x, nterms = 3, a, xnew = NULL)
alfa.pprcomp(y, x, nterms = 3, a, xnew = NULL)
y |
A numerical vector with the continuous variable. |
x |
A matrix with the compositional data. Zero values are allowed. |
nterms |
The number of terms to include in the final model. |
a |
The value of |
xnew |
If you have new data use it, otherwise leave it NULL. |
This is the standard projection pursuit. See the built-in function "ppr" for
more details. The compositional data are transformed with the -transformation
A list including:
runtime |
The runtime of the regression. |
mod |
The produced model as returned by the function "ppr". |
est |
The fitted values of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817-823. doi: 10.2307/2287576.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
alfapprcomp.tune, pprcomp, comp.ppr
x <- as.matrix( iris[, 2:4] ) x <- x / rowSums(x) y <- iris[, 1] alfa.pprcomp(y, x, a = 0.5)
x <- as.matrix( iris[, 2:4] ) x <- x / rowSums(x) y <- iris[, 1] alfa.pprcomp(y, x, a = 0.5)
Projections based test for distributional equality of two groups.
dptest(x1, x2, B = 100)
dptest(x1, x2, B = 100)
x1 |
A matrix containing compositional data of the first group. |
x2 |
A matrix containing compositional data of the second group. |
B |
The number of random uniform projections to use. |
The test compares the distributions of two compositional datasets using random projections. For more details see Cuesta-Albertos, Cuevas and Fraiman (2009).
A vector including:
pvalues |
The p-values of the Kolmogorov-Smirnov tests. |
pvalue |
The p-value of the test based on the Benjamini and Heller (2008) procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Cuesta-Albertos J. A., Cuevas A. and Fraiman, R. (2009). On projection-based tests for directional and compositional data. Statistics and Computing, 19: 367–380.
Benjamini Y. and Heller R. (2008). Screening for partial conjunction hypotheses. Biometrics, 64(4): 1215–1222.
x1 <- rdiri(50, c(3, 4, 5)) ## Fisher distribution with low concentration x2 <- rdiri(50, c(3, 4, 5)) dptest(x1, x2)
x1 <- rdiri(50, c(3, 4, 5)) ## Fisher distribution with low concentration x2 <- rdiri(50, c(3, 4, 5)) dptest(x1, x2)
Proportionality correlation coefficient matrix.
pcc(x)
pcc(x)
x |
A numerical matrix with the compositional data. Zeros are not allowed as the logarithm is applied. |
The function returns the proportionality correlation coefficient matrix. See Lovell et al. (2015) for more information.
A matrix with the alr transformed data (if alr is used) or with the compositional data (if the alrinv is used).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in medicine, 19(10), 1265-1275.
Lovell D., Pawlowsky-Glahn V., Egozcue J. J., Marguerat S. and Bahler, J. (2015). Proportionality: a valid alternative to correlation for relative data. PLoS Computational Biology, 11(3), e1004075.
x <- Compositional::rdiri(100, runif(4) ) a <- Compositional::pcc(x)
x <- Compositional::rdiri(100, runif(4) ) a <- Compositional::pcc(x)
Quasi binomial regression for proportions.
propreg(y, x, varb = "quasi", tol = 1e-07, maxiters = 100) propregs(y, x, varb = "quasi", tol = 1e-07, logged = FALSE, maxiters = 100)
propreg(y, x, varb = "quasi", tol = 1e-07, maxiters = 100) propregs(y, x, varb = "quasi", tol = 1e-07, logged = FALSE, maxiters = 100)
y |
A numerical vector proportions. 0s and 1s are allowed. |
x |
For the "propreg" a matrix with data, the predictor variables. This can be a matrix or a data frame. For the "propregs" this must be a numerical matrix, where each columns denotes a variable. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. This is set to |
varb |
The type of estimate to be used in order to estimate the covariance matrix of the regression coefficients. There are two options, either "quasi" (default value) or "glm". See the references for more information. |
logged |
Should the p-values be returned (FALSE) or their logarithm (TRUE)? |
maxiters |
The maximum number of iterations before the Newton-Raphson is terminated automatically. |
We are using the Newton-Raphson, but unlike R's built-in function "glm" we do no checks and no extra calculations, or whatever. Simply the model. The "propregs" is to be used for very many univariate regressions. The "x" is a matrix in this case and the significance of each variable (column of the matrix) is tested. The function accepts binary responses as well (0 or 1).
For the "propreg" function a list including:
iters |
The number of iterations required by the Newton-Raphson. |
varb |
The covariance matrix of the regression coefficients. |
phi |
The phi parameter is returned if the input argument "varb" was set to "glm", othwerise this is NULL. |
info |
A table similar to the one produced by "glm" with the estimated regression coefficients, their standard error, Wald test statistic and p-values. |
For the "propregs" a two-column matrix with the test statistics (Wald statistic) and the associated p-values (or their loggarithm).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Papke L. E. & Wooldridge J. (1996). Econometric methods for fractional response variables with an application to 401(K) plan participation rates. Journal of Applied Econometrics, 11(6): 619–632.
McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
y <- rbeta(100, 1, 4) x <- matrix(rnorm(100 * 3), ncol = 3) a <- propreg(y, x) y <- rbeta(100, 1, 4) x <- matrix(rnorm(400 * 100), ncol = 400) b <- propregs(y, x) mean(b[, 2] < 0.05)
y <- rbeta(100, 1, 4) x <- matrix(rnorm(100 * 3), ncol = 3) a <- propreg(y, x) y <- rbeta(100, 1, 4) x <- matrix(rnorm(400 * 100), ncol = 400) b <- propregs(y, x) mean(b[, 2] < 0.05)
interval
Random values generation from some univariate distributions defined on the interval.
rbeta1(n, a) runitweibull(n, a, b) rlogitnorm(n, m, s, fast = FALSE)
rbeta1(n, a) runitweibull(n, a, b) rlogitnorm(n, m, s, fast = FALSE)
n |
The sample size, a numerical value. |
a |
The shape parameter of the beta distribution. In the case of the unit Weibull, this is the shape parameter. |
b |
This is the scale parameter for the unit Weibull distribution. |
m |
The mean of the univariate normal in |
s |
The standard deviation of the univariate normal in |
fast |
If you want a faster generation set this equal to TRUE. This will use the Rnorm() function from the Rfast package. However, the speed is only observable if you want to simulate at least 500 (this number may vary among computers) observations. The larger the sample size the higher the speed-up. |
The function genrates random values from the Be(a, 1), the unit Weibull or the univariate logistic normal distribution.
A vector with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- rbeta1(100, 3)
x <- rbeta1(100, 3)
Read a file as a Filebacked Big Matrix.
read.fbm(file, select)
read.fbm(file, select)
file |
The File to read. |
select |
Indices of columns to read (sorted). The length of select will be the number of columns of the resulting FBM. |
The functions read a file as a Filebacked Big Matrix object. For more information see the "bigstatsr" package.
A Filebacked Big Matrix object.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( runif(50 * 20, 0, 2*pi), ncol = 20 )
x <- matrix( runif(50 * 20, 0, 2*pi), ncol = 20 )
-transformation
Regression with compositional data using the -transformation.
alfa.reg(y, x, a, xnew = NULL, yb = NULL) alfa.reg2(y, x, a, xnew = NULL) alfa.reg3(y, x, a = c(-1, 1), xnew = NULL)
alfa.reg(y, x, a, xnew = NULL, yb = NULL) alfa.reg2(y, x, a, xnew = NULL) alfa.reg3(y, x, a = c(-1, 1), xnew = NULL)
y |
A matrix with the compositional data. |
x |
A matrix with the continuous predictor variables or a data frame including categorical predictor variables. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
xnew |
If you have new data use it, otherwise leave it NULL. |
yb |
If you have already transformed the data using the This is intended to be used in the function |
The -transformation is applied to the compositional data first and then multivariate regression is applied. This involves numerical optimisation. The alfa.reg2() function accepts a vector with many values of
, while the the alfa.reg3() function searches for the value of
that minimizes the Kulback-Leibler divergence between the observed and the fitted compositional values. The functions are highly optimized.
For the alfa.reg() function a list including:
runtime |
The time required by the regression. |
be |
The beta coefficients. |
seb |
The standard error of the beta coefficients. |
est |
The fitted values for xnew if xnew is not NULL. |
For the alfa.reg2() function a list with as many sublists as the number of values of . Each element (sublist) of the list contains the above outcomes of the alfa.reg() function.
For the alfa.reg3() function a list with all previous elements plus an output "alfa", the optimal value of .
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2015). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2): 47-57. https://arxiv.org/pdf/1508.01913v1.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
Mardia K.V., Kent J.T., and Bibby J.M. (1979). Multivariate analysis. Academic press.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
alfareg.tune, diri.reg, js.compreg, kl.compreg,
ols.compreg, comp.reg
library(MASS) x <- as.vector(fgl[1:40, 1]) y <- as.matrix(fgl[1:40, 2:9]) y <- y / rowSums(y) mod <- alfa.reg(y, x, 0.2)
library(MASS) x <- as.vector(fgl[1:40, 1]) y <- as.matrix(fgl[1:40, 2:9]) y <- y / rowSums(y) mod <- alfa.reg(y, x, 0.2)
-transformation
Regularised and flexible discriminant analysis for compositional data using the -transformation.
alfa.rda(xnew, x, ina, a, gam = 1, del = 0) alfa.fda(xnew, x, ina, a)
alfa.rda(xnew, x, ina, a, gam = 1, del = 0) alfa.fda(xnew, x, ina, a)
xnew |
A matrix with the new compositional data whose group is to be predicted. Zeros are allowed, but you must be careful to choose strictly positive vcalues of |
x |
A matrix with the available compositional data. Zeros are allowed, but you must be careful to choose strictly positive vcalues of |
ina |
A group indicator variable for the available data. |
a |
The value of |
gam |
This is a number between 0 and 1. It is the weight of the pooled covariance and the diagonal matrix. |
del |
This is a number between 0 and 1. It is the weight of the LDA and QDA. |
For the alfa.rda, the covariance matrix of each group is calcualted and then the pooled covariance matrix. The spherical covariance matrix consists of the average of the pooled variances in its diagonal and zeros in the off-diagonal elements. gam is the weight of the pooled covariance matrix and 1-gam is the weight of the spherical covariance matrix, Sa = gam * Sp + (1-gam) * sp. Then it is a compromise between LDA and QDA. del is the weight of Sa and 1-del the weight of each group covariance group.
For the alfa.fda a flexible discriminant analysis is performed. See the R package fda for more details.
For the alfa.rda a list including:
prob |
The estimated probabilities of the new data of belonging to each group. |
scores |
The estimated socres of the new data of each group. |
est |
The estimated group membership of the new data. |
For the alfa.fda a list including:
mod |
An fda object as returned by the command fda of the R package mda. |
est |
The estimated group membership of the new data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Friedman Jerome, Trevor Hastie and Robert Tibshirani (2009). The elements of statistical learning, 2nd edition. Springer, Berlin.
Tsagris Michail, Simon Preston and Andrew T.A. Wood (2016). Improved classification for compositional data using the -transformation. Journal of classification, 33(2): 243-261.
https://arxiv.org/pdf/1106.1451.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
Hastie, Tibshirani and Buja (1994). Flexible Disriminant Analysis by Optimal Scoring. Journal of the American Statistical Association, 89(428):1255-1270.
alfa, alfarda.tune, alfa.knn, alfa.nb, comp.nb, mix.compnorm
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) ina <- iris[, 5] mod <- alfa.rda(x, x, ina, 0) table(ina, mod$est) mod2 <- alfa.fda(x, x, ina, 0) table(ina, mod2$est)
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) ina <- iris[, 5] mod <- alfa.rda(x, x, ina, 0) table(ina, mod$est) mod2 <- alfa.fda(x, x, ina, 0) table(ina, mod2$est)
Regularised discriminant analysis for Euclidean data.
rda(xnew, x, ina, gam = 1, del = 0)
rda(xnew, x, ina, gam = 1, del = 0)
xnew |
A matrix with the new data whose group is to be predicted. They have to be continuous. |
x |
A matrix with the available data. They have to be continuous. |
ina |
A group indicator variable for the avaiable data. |
gam |
This is a number between 0 and 1. It is the weight of the pooled covariance and the diagonal matrix. |
del |
This is a number between 0 and 1. It is the weight of the LDA and QDA. |
The covariance matrix of each group is calculated and then the pooled covariance matrix. The spherical covariance matrix consists of the average of the pooled variances in its diagonal and zeros in the off-diagonal elements. gam is the weight of the pooled covariance matrix and 1-gam is the weight of the spherical covariance matrix, Sa = gam * Sp + (1-gam) * sp. Then it is a compromise between LDA and QDA. del is the weight of Sa and 1-del the weight of each group covariance group.
A list including:
prob |
The estimated probabilities of the new data of belonging to each group. |
scores |
The estimated socres of the new data of each group. |
est |
The estimated group membership of the new data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J.H. (1989): Regularized Discriminant Analysis. Journal of the American Statistical Association 84(405): 165–175.
Friedman Jerome, Trevor Hastie and Robert Tibshirani (2009). The elements of statistical learning, 2nd edition. Springer, Berlin.
Tsagris M., Preston S. and Wood A.T.A. (2016). Improved classification for
compositional data using the -transformation.
Journal of Classification, 33(2): 243–261.
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] mod <- rda(x, x, ina) table(ina, mod$est)
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] mod <- rda(x, x, ina) table(ina, mod$est)
Ridge regression.
ridge.reg(y, x, lambda, B = 1, xnew = NULL)
ridge.reg(y, x, lambda, B = 1, xnew = NULL)
y |
A real valued vector. If it contains percentages, the logit transformation is applied. |
x |
A matrix with the predictor variable(s), they have to be continuous. |
lambda |
The value of the regularisation parameter |
B |
If B = 1 (default value) no bootstrpa is performed. Otherwise bootstrap standard errors are returned. |
xnew |
If you have new data whose response value you want to predict put it here, otherwise leave it as is. |
This is used in the function alfa.ridge
. There is also a built-in function available from the MASS library, called "lm.ridge".
A list including:
beta |
The beta coefficients. |
seb |
The standard eror of the coefficiens. If B > 1 the bootstrap standard errors will be returned. |
est |
The fitted or the predicted values (if xnew is not NULL). |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Hoerl A.E. and R.W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1): 55-67.
Brown P. J. (1994). Measurement, Regression and Calibration. Oxford Science Publications.
ridge.tune, alfa.ridge, ridge.plot
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) mod1 <- ridge.reg(y, x, lambda = 0.1) mod2 <- ridge.reg(y, x, lambda = 0)
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) mod1 <- ridge.reg(y, x, lambda = 0.1) mod2 <- ridge.reg(y, x, lambda = 0)
A plot of the regularised regression coefficients is shown.
ridge.plot(y, x, lambda = seq(0, 5, by = 0.1) )
ridge.plot(y, x, lambda = seq(0, 5, by = 0.1) )
y |
A numeric vector containing the values of the target variable. If the values are proportions or percentages, i.e. strictly within 0 and 1 they are mapped into R using the logit transformation. In any case, they must be continuous only. |
x |
A numeric matrix containing the continuous variables. Rows are samples and columns are features. |
lambda |
A grid of values of the regularisation parameter |
For every value of the coefficients are obtained. They are plotted versus the
values.
A plot with the values of the coefficients as a function of .
Michail Tsagris.
R implementation and documentation: Giorgos Athineou <[email protected]> and Michail Tsagris [email protected].
Hoerl A.E. and R.W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1): 55-67.
Brown P. J. (1994). Measurement, Regression and Calibration. Oxford Science Publications.
ridge.reg, ridge.tune, alfa.ridge, alfaridge.plot
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) ridge.plot(y, x, lambda = seq(0, 2, by = 0.1) )
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) ridge.plot(y, x, lambda = seq(0, 2, by = 0.1) )
-transformation
Ridge regression with compositional data in the covariates side using the -transformation.
alfa.ridge(y, x, a, lambda, B = 1, xnew = NULL)
alfa.ridge(y, x, a, lambda, B = 1, xnew = NULL)
y |
A numerical vector containing the response variable values. If they are percentages, they are mapped onto |
x |
A matrix with the predictor variables, the compositional data. Zero values are allowed, but you must be careful to choose strictly positive vcalues of |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
lambda |
The value of the regularisation parameter, |
B |
If B > 1 bootstrap estimation of the standard errors is implemented. |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
The -transformation is applied to the compositional data first and then ridge components regression is performed.
The output of the ridge.reg.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M. (2015). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2): 47-57. https://arxiv.org/pdf/1508.01913v1.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
ridge.reg, alfaridge.tune, alfaridge.plot
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x/ rowSums(x) mod1 <- alfa.ridge(y, x, a = 0.5, lambda = 0.1, B = 1, xnew = NULL) mod2 <- alfa.ridge(y, x, a = 0.5, lambda = 1, B = 1, xnew = NULL)
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x/ rowSums(x) mod1 <- alfa.ridge(y, x, a = 0.5, lambda = 0.1, B = 1, xnew = NULL) mod2 <- alfa.ridge(y, x, a = 0.5, lambda = 1, B = 1, xnew = NULL)
A plot of the regularised regression coefficients is shown.
alfaridge.plot(y, x, a, lambda = seq(0, 5, by = 0.1) )
alfaridge.plot(y, x, a, lambda = seq(0, 5, by = 0.1) )
y |
A numeric vector containing the values of the target variable. If the values are proportions or percentages, i.e. strictly within 0 and 1 they are mapped into R using the logit transformation. In any case, they must be continuous only. |
x |
A numeric matrix containing the continuous variables. |
a |
The value of the |
lambda |
A grid of values of the regularisation parameter |
For every value of the coefficients are obtained. They are plotted versus the
values.
A plot with the values of the coefficients as a function of .
Michail Tsagris.
R implementation and documentation: Giorgos Athineou <[email protected]> and Michail Tsagris [email protected].
Hoerl A.E. and R.W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1): 55-67.
Brown P. J. (1994). Measurement, Regression and Calibration. Oxford Science Publications.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) alfaridge.plot(y, x, a = 0.5, lambda = seq(0, 5, by = 0.1) )
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) alfaridge.plot(y, x, a = 0.5, lambda = seq(0, 5, by = 0.1) )
Simplicial constrained median regression for compositional responses and predictors.
scrq(y, x, xnew = NULL)
scrq(y, x, xnew = NULL)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are allowed. |
xnew |
If you have new data use it, otherwise leave it NULL. |
The function performs median regression where the beta coefficients are constained to be positive and sum to 1.
A list including:
mlad |
The mean absolute deviation. |
be |
The beta coefficients. |
est |
The fitted of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- scrq(y, x) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- scrq(y, x) mod
Simulation of compositional data from Gaussian mixture models.
rmixcomp(n, prob, mu, sigma, type = "alr")
rmixcomp(n, prob, mu, sigma, type = "alr")
n |
The sample size. |
prob |
A vector with mixing probabilities. Its length is equal to the number of clusters. |
mu |
A matrix where each row corresponds to the mean vector of each cluster. |
sigma |
An array consisting of the covariance matrix of each cluster. |
type |
Should the additive ("type=alr") or the isometric (type="ilr") log-ration be used? The default value is for the additive log-ratio transformation. |
A sample from a multivariate Gaussian mixture model is generated.
A list including:
id |
A numeric variable indicating the cluster of simulated vector. |
x |
A matrix containing the simulated compositional data. The number of dimensions will be + 1. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ryan P. Browne, Aisha ElSherbiny and Paul D. McNicholas (2015). R package mixture: Mixture Models for Clustering and Classification.
p <- c(1/3, 1/3, 1/3) mu <- matrix(nrow = 3, ncol = 4) s <- array( dim = c(4, 4, 3) ) x <- as.matrix(iris[, 1:4]) ina <- as.numeric(iris[, 5]) mu <- rowsum(x, ina) / 50 s[, , 1] <- cov(x[ina == 1, ]) s[, , 2] <- cov(x[ina == 2, ]) s[, , 3] <- cov(x[ina == 3, ]) y <- rmixcomp(100, p, mu, s, type = "alr")
p <- c(1/3, 1/3, 1/3) mu <- matrix(nrow = 3, ncol = 4) s <- array( dim = c(4, 4, 3) ) x <- as.matrix(iris[, 1:4]) ina <- as.numeric(iris[, 5]) mu <- rowsum(x, ina) / 50 s[, , 1] <- cov(x[ina == 1, ]) s[, , 2] <- cov(x[ina == 2, ]) s[, , 3] <- cov(x[ina == 3, ]) y <- rmixcomp(100, p, mu, s, type = "alr")
Simulation of compositional data from mixtures of Dirichlet distributions.
rmixdiri(n, a, prob)
rmixdiri(n, a, prob)
n |
The sample size. |
a |
A matrix where each row contains the parameters of each Dirichlet component. |
prob |
A vector with the mixing probabilities. |
A sample from a Dirichlet mixture model is generated.
A list including:
id |
A numeric variable indicating the cluster of simulated vector. |
x |
A matrix containing the simulated compositional data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Ye X., Yu Y. K. and Altschul S. F. (2011). On the inference of Dirichlet mixture priors for protein sequence comparison. Journal of Computational Biology, 18(8), 941-954.
a <- matrix( c(12, 30, 45, 32, 50, 16), byrow = TRUE,ncol = 3) prob <- c(0.5, 0.5) x <- rmixdiri(100, a, prob)
a <- matrix( c(12, 30, 45, 32, 50, 16), byrow = TRUE,ncol = 3) prob <- c(0.5, 0.5) x <- rmixdiri(100, a, prob)
Simulation of compositional data from the Flexible Dirichlet distribution.
rfd(n, alpha, prob, tau)
rfd(n, alpha, prob, tau)
n |
The sample size. |
alpha |
A vector of the non-negative |
prob |
A vector of the clusters' probabilities that must sum to one. |
tau |
The positive scalar |
For more information see the references and the package FlxeDir.
A matrix with compositional data.
Michail Tsagris ported from the R package FlexDir. [email protected].
Ongaro A. and Migliorati S. (2013). A generalization of the Dirichlet distribution. Journal of Multivariate Analysis, 114, 412–426.
Migliorati S., Ongaro A. and Monti G. S. (2017). A structured Dirichlet mixture model for compositional data: inferential and applicative issues. Statistics and Computing, 27, 963–983.
alpha <- c(12, 11, 10) prob <- c(0.25, 0.25, 0.5) x <- rfd(100, alpha, prob, 7)
alpha <- c(12, 11, 10) prob <- c(0.25, 0.25, 0.5) x <- rfd(100, alpha, prob, 7)
Simulation of compositional data from the folded model normal distribution.
rfolded(n, mu, su, a)
rfolded(n, mu, su, a)
n |
The sample size. |
mu |
The mean vector. |
su |
The covariance matrix. |
a |
The value of |
A sample from the folded model is generated.
A matrix with compositional data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Stewart C. (2020). A folded model for compositional data analysis. Australian and New Zealand Journal of Statistics, 62(2): 249-277. https://arxiv.org/pdf/1802.07330.pdf
s <- c(0.1490676523, -0.4580818209, 0.0020395316, -0.0047446076, -0.4580818209, 1.5227259250, 0.0002596411, 0.0074836251, 0.0020395316, 0.0002596411, 0.0365384838, -0.0471448849, -0.0047446076, 0.0074836251, -0.0471448849, 0.0611442781) s <- matrix(s, ncol = 4) m <- c(1.715, 0.914, 0.115, 0.167) x <- rfolded(100, m, s, 0.5) a.est(x)
s <- c(0.1490676523, -0.4580818209, 0.0020395316, -0.0047446076, -0.4580818209, 1.5227259250, 0.0002596411, 0.0074836251, 0.0020395316, 0.0002596411, 0.0365384838, -0.0471448849, -0.0047446076, 0.0074836251, -0.0471448849, 0.0611442781) s <- matrix(s, ncol = 4) m <- c(1.715, 0.914, 0.115, 0.167) x <- rfolded(100, m, s, 0.5) a.est(x)
Spatial median regression with Euclidean data.
spatmed.reg(y, x, xnew = NULL, tol = 1e-07, ses = FALSE)
spatmed.reg(y, x, xnew = NULL, tol = 1e-07, ses = FALSE)
y |
A matrix with the compositional data. Zero values are not allowed. |
x |
The predictor variable(s), they have to be continuous. |
xnew |
If you have new data use it, otherwise leave it NULL. |
tol |
The threshold upon which to stop the iterations of the Newton-Rapshon algorithm. |
ses |
If you want to extract the standard errors of the parameters, set this to TRUE. Be careful though as this can slow down the algorithm dramatically. In a run example with 10,000 observations and 10 variables for y and 30 for x, when ses = FALSE the algorithm can take 0.20 seconds, but when ses = TRUE it can go up to 140 seconds. |
The objective function is the minimization of the sum of the absolute residuals. It is the multivariate generalization of the median regression.
This function is used by comp.reg
.
A list including:
iter |
The number of iterations that were required. |
runtime |
The time required by the regression. |
be |
The beta coefficients. |
seb |
The standard error of the beta coefficients is returned if ses=TRUE and NULL otherwise. |
est |
The fitted of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Biman Chakraborty (2003). On multivariate quantile regression. Journal of Statistical Planning and Inference, 110(1-2), 109-132. http://www.stat.nus.edu.sg/export/sites/dsap/research/documents/tr01_2000.pdf
multivreg, comp.reg, alfa.reg, js.compreg, diri.reg
library(MASS) x <- as.matrix(iris[, 3:4]) y <- as.matrix(iris[, 1:2]) mod1 <- spatmed.reg(y, x) mod2 <- multivreg(y, x, plot = FALSE)
library(MASS) x <- as.matrix(iris[, 3:4]) y <- as.matrix(iris[, 1:2]) mod1 <- spatmed.reg(y, x) mod2 <- multivreg(y, x, plot = FALSE)
Ternary diagram.
ternary(x, dg = FALSE, hg = FALSE, means = TRUE, pca = FALSE, colour = NULL)
ternary(x, dg = FALSE, hg = FALSE, means = TRUE, pca = FALSE, colour = NULL)
x |
A matrix with the compositional data. |
dg |
Do you want diagonal grid lines to appear? If yes, set this TRUE. |
hg |
Do you want horizontal grid lines to appear? If yes, set this TRUE. |
means |
A boolean variable. Should the closed geometric mean and the arithmetic mean appear (TRUE) or not (FALSE)?. |
pca |
Should the first PCA calculated Aitchison (1983) described appear? If yes, then this should be TRUE, or FALSE otherwise. |
colour |
If you want the points to appear in different colour put a vector with the colour numbers or colours. |
There are two ways to create a ternary graph. We used here that one where each edge is equal to 1 and it is what Aitchison (1986) uses. For every given point, the sum of the distances from the edges is equal to 1. Horizontal and or diagonal grid lines can appear, so as the closed geometric and the simple arithmetic mean. The first PCA is calculated using the centred log-ratio transformation as Aitchison (1983, 1986) suggested. If the data contain zero values, the first PCA will not be plotted. Zeros in the data appear with green circles in the triangle and you will also see NaN in the closed geometric mean.
The ternary plot and a 2-row matrix with the means. The closed geometric and the simple arithmetic mean vector and or the first principal component will appear as well if the user has asked for them. Additionally, horizontal or diagonal grid lines can appear as well.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika 70(1):57-65.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
ternary.mcr, ternary.reg, diri.contour
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) ternary(x, means = TRUE, pca = TRUE)
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) ternary(x, means = TRUE, pca = TRUE)
Ternary diagram of regression models.
ternary.reg(y, est, id, labs)
ternary.reg(y, est, id, labs)
y |
A matrix with the compositional data. |
est |
A matrix with all fitted compositional data for all regression models, one under the other. |
id |
A vector indicating the regression model of each fitted compositional data set. |
labs |
The names of the regression models to appea in the legend. |
The points first appear on the ternary plot. Then, the fitted compositional data appear with different lines for each regression model.
The ternary plot and lines for the fitted values of each regression model.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
ternary, ternary.mcr, diri.contour
x <- cbind(1, rnorm(50) ) a <- exp( x %*% matrix( rnorm(6,0, 0.4), ncol = 3) ) y <- matrix(NA, 50, 3) for (i in 1:50) y[i, ] <- rdiri(1, a[i, ]) est <- comp.reg(y, x[, -1], xnew = x[, -1])$est ternary.reg(y, est, id = rep(1, 50), labs = "ALR regression")
x <- cbind(1, rnorm(50) ) a <- exp( x %*% matrix( rnorm(6,0, 0.4), ncol = 3) ) y <- matrix(NA, 50, 3) for (i in 1:50) y[i, ] <- rdiri(1, a[i, ]) est <- comp.reg(y, x[, -1], xnew = x[, -1])$est ternary.reg(y, est, id = rep(1, 50), labs = "ALR regression")
Ternary diagram with confidence region for the matrix of coefficients of the SCLS or the TFLR model.
ternary.coefcr(y, x, type = "scls", conf = 0.95, R = 1000, dg = FALSE, hg = FALSE)
ternary.coefcr(y, x, type = "scls", conf = 0.95, R = 1000, dg = FALSE, hg = FALSE)
y |
A matrix with the response compositional data. |
x |
A matrix with the predictor compositional data. |
type |
The type of model to use, "scls" or "tflr". Depending on the model selected, the function will construct the confidence regions of the estimated matrix of coefficients of that model. |
conf |
The confidence level, by default this is set to 0.95. |
R |
Number of bootstrap replicates to run. |
dg |
Do you want diagonal grid lines to appear? If yes, set this TRUE. |
hg |
Do you want horizontal grid lines to appear? If yes, set this TRUE. |
This function runs the SCLS or the TFLR model and constructs confidence regions for the estimated matrix of regression coefficients using non-parametric bootstrap.
A ternary plot of the estimated matrix of coefficients of the SCLS or of the TFLR model, and their associated confidence regions.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
ternary, scls, tflr, ternary.mcr
y <- rdiri(50, runif(3)) x <- rdiri(50, runif(4)) ternary.coefcr(y, x, R = 500, dg = TRUE, hg = TRUE)
y <- rdiri(50, runif(3)) x <- rdiri(50, runif(4)) ternary.coefcr(y, x, R = 500, dg = TRUE, hg = TRUE)
Ternary diagram with confidence region for the mean.
ternary.mcr(x, type = "alr", conf = 0.95, dg = FALSE, hg = FALSE, colour = NULL)
ternary.mcr(x, type = "alr", conf = 0.95, dg = FALSE, hg = FALSE, colour = NULL)
x |
A matrix with the compositional data. |
dg |
Do you want diagonal grid lines to appear? If yes, set this TRUE. |
type |
The type of log-ratio transformation to aply, the "alr" or the "ilr". |
conf |
The confidence level, by default this is set to 0.95. |
hg |
Do you want horizontal grid lines to appear? If yes, set this TRUE. |
colour |
If you want the points to appear in different colour put a vector with the colour numbers or colours. |
Ternary plot of compositional data including the log-ratio mean and its confidence region.
The confidence region is based on the Hotelling test statistic of the log-ratio
transformed data.
A ternary plot of compositional data including the log-ratio mean and its confidence region.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika 70(1):57-65.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
ternary, ternary.reg, diri.contour
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) ternary.mcr(x, type = "alr", dg = TRUE, hg = TRUE)
x <- as.matrix(iris[, 1:3]) x <- x / rowSums(x) ternary.mcr(x, type = "alr", dg = TRUE, hg = TRUE)
Ternary diagram with the coefficients of the simplicial-simplicial regression models.
ternary.coef(B, dg = FALSE, hg = FALSE, colour = NULL)
ternary.coef(B, dg = FALSE, hg = FALSE, colour = NULL)
B |
A matrix with the coefficients of the |
dg |
Do you want diagonal grid lines to appear? If yes, set this TRUE. |
hg |
Do you want horizontal grid lines to appear? If yes, set this TRUE. |
colour |
If you want the points to appear in different colour put a vector with the colour numbers or colours. |
Ternary plot of the coefficients of the tflr
or
the scls
functions.
A ternary plot of the coefficients of the tflr
or the scls
functions.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika 70(1):57-65.
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) x <- rdiri(150, runif(5, 1,4) ) mod <- scls(y, x) ternary.coef(mod$be)
y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) x <- rdiri(150, runif(5, 1,4) ) mod <- scls(y, x) ternary.coef(mod$be)
The additive log-ratio transformation and its inverse.
alr(x) alrinv(y)
alr(x) alrinv(y)
x |
A numerical matrix with the compositional data. |
y |
A numerical matrix with data to be closed into the simplex. |
The additive log-ratio transformation with the first component being the common divisor is applied. The inverse of this transformation is also available. This means that no zeros are allowed.
A matrix with the alr transformed data (if alr is used) or with the compositional data (if the alrinv is used).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
bc, pivot, fp, green, alfa, alfainv
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- alr(x) x1 <- alrinv(y)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- alr(x) x1 <- alrinv(y)
-distance
This is the Euclidean (or Manhattan) distance after the -transformation has been applied.
alfadist(x, a, type = "euclidean", square = FALSE) alfadista(xnew, x, a, type = "euclidean", square = FALSE)
alfadist(x, a, type = "euclidean", square = FALSE) alfadista(xnew, x, a, type = "euclidean", square = FALSE)
xnew |
A matrix or a vector with new compositional data. |
x |
A matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
type |
Which type distance do you want to calculate after the |
square |
In the case of the Euclidean distance, you can choose to return the squared distance by setting this TRUE. |
The -transformation is applied to the compositional data first and then the Euclidean or the Manhattan
distance is calculated.
For "alfadist" a matrix including the pairwise distances of all observations or the distances between xnew and x. For "alfadista" a matrix including the pairwise distances of all observations or the distances between xnew and x.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M.T., Preston S. and Wood A.T.A. (2016). Improved classification for compositional data using the
-transformation. Journal of Classification. 33(2): 243–261.
https://arxiv.org/pdf/1506.04976v2.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) x <- as.matrix(fgl[1:20, 2:9]) x <- x / rowSums(x) alfadist(x, 0.1) alfadist(x, 1)
library(MASS) x <- as.matrix(fgl[1:20, 2:9]) x <- x / rowSums(x) alfadist(x, 0.1) alfadist(x, 1)
-IT transformation
The -IT transformation.
ait(x, a, h = TRUE)
ait(x, a, h = TRUE)
x |
A matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero
values are present it has to be greater than 0. If |
h |
A boolean variable. If is TRUE (default value) the multiplication with the
Helmert sub-matrix will take place. When |
The -IT transformation is applied to the compositional data.
A matrix with the -IT transformed data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Clarotto L., Allard D. and Menafoglio A. (2022). A new class of
-transformations for the spatial analysis of Compositional
Data. Spatial Statistics, 47.
aitdist, ait.knn, alfa, green, alr
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- ait(x, 0.2) y2 <- ait(x, 1) rbind( colMeans(y1), colMeans(y2) )
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- ait(x, 0.2) y2 <- ait(x, 1) rbind( colMeans(y1), colMeans(y2) )
-IT-distance
This is the Euclidean (or Manhattan) distance after the
-IT-transformation has been applied.
aitdist(x, a, type = "euclidean", square = FALSE) aitdista(xnew, x, a, type = "euclidean", square = FALSE)
aitdist(x, a, type = "euclidean", square = FALSE) aitdista(xnew, x, a, type = "euclidean", square = FALSE)
xnew |
A matrix or a vector with new compositional data. |
x |
A matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1.
If zero values are present it has to be greater than 0. If |
type |
Which type distance do you want to calculate after the
|
square |
In the case of the Euclidean distance, you can choose to return the squared distance by setting this TRUE. |
The -IT-transformation is applied to the compositional data first
and then the Euclidean or the Manhattan distance is calculated.
For "alfadist" a matrix including the pairwise distances of all observations or the distances between xnew and x. For "alfadista" a matrix including the pairwise distances of all observations or the distances between xnew and x.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Clarotto L., Allard D. and Menafoglio A. (2021). A new class of
-transformations for the spatial analysis of Compositional Data.
https://arxiv.org/abs/2110.07967
library(MASS) x <- as.matrix(fgl[1:20, 2:9]) x <- x / rowSums(x) aitdist(x, 0.1) aitdist(x, 1)
library(MASS) x <- as.matrix(fgl[1:20, 2:9]) x <- x / rowSums(x) aitdist(x, 0.1) aitdist(x, 1)
-k-NN regression for compositional response data
The -k-NN regression for compositional response data.
aknn.reg(xnew, y, x, a = seq(0.1, 1, by = 0.1), k = 2:10, apostasi = "euclidean", rann = FALSE)
aknn.reg(xnew, y, x, a = seq(0.1, 1, by = 0.1), k = 2:10, apostasi = "euclidean", rann = FALSE)
xnew |
A matrix with the new predictor variables whose compositions are to be predicted. |
y |
A matrix with the compositional response data. Zeros are allowed. |
x |
A matrix with the available predictor variables. |
a |
The value(s) of |
k |
The number of nearest neighbours to consider. It can be a single number or a vector. |
apostasi |
The type of distance to use, either "euclidean" or "manhattan". |
rann |
If you have large scale datasets and want a faster k-NN search, you can use kd-trees implemented in the R package "Rnanoflann". In this case you must set this argument equal to TRUE. Note however, that in this case, the only available distance is by default "euclidean". |
The -k-NN regression for compositional response variables is applied.
A list with the estimated compositional response data for each value of and k.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Stewart C. (2023). Flexible non-parametric regression models for compositional response data with zeros. Statistics and Computing, 33(106).
https://link.springer.com/article/10.1007/s11222-023-10277-5
aknnreg.tune, akern.reg, alfa.reg, comp.ppr, comp.reg, kl.compreg
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- aknn.reg(x, y, x, a = c(0.4, 0.5), k = 2:3, apostasi = "euclidean")
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- aknn.reg(x, y, x, a = c(0.4, 0.5), k = 2:3, apostasi = "euclidean")
-k-NN regression with compositional predictor variables
The -k-NN regression with compositional predictor variables.
alfa.knn.reg(xnew, y, x, a = 1, k = 2:10, apostasi = "euclidean", method = "average")
alfa.knn.reg(xnew, y, x, a = 1, k = 2:10, apostasi = "euclidean", method = "average")
xnew |
A matrix with the new compositional predictor variables whose response is to be predicted. Zeros are allowed. |
y |
The response variable, a numerical vector. |
x |
A matrix with the available compositional predictor variables. Zeros are allowed. |
a |
A single value of |
k |
The number of nearest neighbours to consider. It can be a single number or a vector. |
apostasi |
The type of distance to use, either "euclidean" or "manhattan". |
method |
If you want to take the average of the reponses of the k closest observations, type "average". For the median, type "median" and for the harmonic mean, type "harmonic". |
The -k-NN regression with compositional predictor variables is applied.
A matrix with the estimated response data for each value of k.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Stewart C. (2023). Flexible non-parametric regression models for compositional response data with zeros. Statistics and Computing, 33(106).
https://link.springer.com/article/10.1007/s11222-023-10277-5
aknn.reg, alfa.knn, alfa.pcr, alfa.ridge
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- fgl[, 1] mod <- alfa.knn.reg(x, y, x, a = 0.5, k = 2:4)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- fgl[, 1] mod <- alfa.knn.reg(x, y, x, a = 0.5, k = 2:4)
-kernel regression with compositional response data
The -kernel regression with compositional response data.
akern.reg( xnew, y, x, a = seq(0.1, 1, by = 0.1), h = seq(0.1, 1, length = 10), type = "gauss" )
akern.reg( xnew, y, x, a = seq(0.1, 1, by = 0.1), h = seq(0.1, 1, length = 10), type = "gauss" )
xnew |
A matrix with the new predictor variables whose compositions are to be predicted. |
y |
A matrix with the compositional response data. Zeros are allowed. |
x |
A matrix with the available predictor variables. |
a |
The value(s) of |
h |
The bandwidth value(s) to consider. |
type |
The type of kernel to use, "gauss" or "laplace". |
The -kernel regression for compositional response variables is
applied.
A list with the estimated compositional response data for each value of
and h.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Stewart C. (2023). Flexible non-parametric regression models for compositional response data with zeros. Statistics and Computing, 33(106).
https://link.springer.com/article/10.1007/s11222-023-10277-5
akernreg.tune, aknn.reg, aknnreg.tune,
alfa.reg, comp.ppr, comp.reg, kl.compreg
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- akern.reg( x, y, x, a = c(0.4, 0.5), h = c(0.1, 0.2) )
y <- as.matrix( iris[, 1:3] ) y <- y / rowSums(y) x <- iris[, 4] mod <- akern.reg( x, y, x, a = c(0.4, 0.5), h = c(0.1, 0.2) )
-SCLS model for compositional responses and predictors
The -SCLS model for compositional responses and predictors.
ascls(y, x, a = seq(0.1, 1, by = 0.1), xnew)
ascls(y, x, a = seq(0.1, 1, by = 0.1), xnew)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are allowed. |
a |
A vector or a single number of values of the |
xnew |
The new data for which predictions will be made. |
This is an extension of the SCLS model that includes the -transformation and is intended solely for prediction purposes.
A list with matrices containing the predicted simplicial response values, one matrix for each value of .
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- ascls(y, x, xnew = x) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- ascls(y, x, xnew = x) mod
-TFLR model for compositional responses and predictors
The -TFLR model for compositional responses and predictors.
atflr(y, x, a = seq(0.1, 1, by = 0.1), xnew)
atflr(y, x, a = seq(0.1, 1, by = 0.1), xnew)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are allowed. |
a |
A vector or a single number of values of the |
xnew |
The new data for which predictions will be made. |
This is an extension of the TFLR model that includes the -transformation and is intended solely for prediction purposes.
A list with matrices containing the predicted simplicial response values, one matrix for each value of .
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- ascls(y, x, a = c(0.5, 1), xnew = x) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- ascls(y, x, a = c(0.5, 1), xnew = x) mod
-transformation
The -transformation.
alfa(x, a, h = TRUE) alef(x, a)
alfa(x, a, h = TRUE) alef(x, a)
x |
A matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to
be greater than 0. If |
h |
A boolean variable. If is TRUE (default value) the multiplication with the Helmert sub-matrix will take place.
When |
The -transformation is applied to the compositional data. The command "alef" is the same as
"alfa(x, a, h = FALSE)", but reurns a different element as well and is necessary for the functions
a.est
, a.mle
and alpha.mle
.
A list including:
sa |
The logarithm of the Jacobian determinant of the |
sk |
If the "alef" was called, this will return the sum of the |
aff |
The |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M. and Stewart C. (2022). A Review of Flexible Transformations for Modeling Compositional Data. In Advances and Innovations in Statistics and Data Science, pp. 225–234. https://link.springer.com/chapter/10.1007/978-3-031-08329-7_10
Tsagris Michail and Stewart Connie (2020). A folded model for compositional data analysis. Australian and New Zealand Journal of Statistics, 62(2): 249-277. https://arxiv.org/pdf/1802.07330.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
alfainv, pivot, alfa.profile, alfa.tune
a.est, alpha.mle, alr, bc, fp, green
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- alfa(x, 0.2)$aff y2 <- alfa(x, 1)$aff rbind( colMeans(y1), colMeans(y2) ) y3 <- alfa(x, 0.2)$aff dim(y1) ; dim(y3) rowSums(y1) rowSums(y3)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- alfa(x, 0.2)$aff y2 <- alfa(x, 1)$aff rbind( colMeans(y1), colMeans(y2) ) y3 <- alfa(x, 0.2)$aff dim(y1) ; dim(y3) rowSums(y1) rowSums(y3)
The Box-Cox transformation applied to ratios of components.
bc(x, lambda)
bc(x, lambda)
x |
A matrix with the compositional data. The first component must be zero values free. |
lambda |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to
be greater than 0. If |
The Box-Cox transformation applied to ratios of components, as described in Aitchison (1986) is applied.
A matrix with the transformed data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- bc(x, 0.2) y2 <- bc(x, 0) rbind( colMeans(y1), colMeans(y2) ) rowSums(y1) rowSums(y2)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- bc(x, 0.2) y2 <- bc(x, 0) rbind( colMeans(y1), colMeans(y2) ) rowSums(y1) rowSums(y2)
The ESOV-distance.
esov(x) esova(xnew, x) es(x1, x2)
esov(x) esova(xnew, x) es(x1, x2)
x |
A matrix with compositional data. |
xnew |
A matrix or a vector with new compositional data. |
x1 |
A vector with compositional data. |
x2 |
A vector with compositional data. |
The ESOV distance is calculated.
For "esov()" a matrix including the pairwise distances of all observations or the distances between xnew and x.
For "esova()" a matrix including the pairwise distances of all observations or the distances between xnew and x.
For "es()" a number, the ESOV distance between x1 and x2.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris, Michail (2014). The k-NN algorithm for compositional data: a revised approach with and without zero values present. Journal of Data Science, 12(3): 519-534.
Endres, D. M. and Schindelin, J. E. (2003). A new metric for probability distributions. Information Theory, IEEE Transactions on 49, 1858-1860.
Osterreicher, F. and Vajda, I. (2003). A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics 55, 639-653.
alfadist, comp.knn, js.compreg
library(MASS) x <- as.matrix(fgl[1:20, 2:9]) x <- x / rowSums(x) esov(x)
library(MASS) x <- as.matrix(fgl[1:20, 2:9]) x <- x / rowSums(x) esov(x)
The folded power transformation.
fp(x, lambda)
fp(x, lambda)
x |
A matrix with the compositional data. Zero values are allowed. |
lambda |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to
be greater than 0. If |
The folded power transformation is applied to the compositional data.
A matrix with the transformed data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Atkinson, A. C. (1985). Plots, transformations and regression; an introduction to graphical methods of diagnostic regression analysis Oxford University Press.
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- fp(x, 0.2) y2 <- fp(x, 0) rbind( colMeans(y1), colMeans(y2) ) rowSums(y1) rowSums(y2)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y1 <- fp(x, 0.2) y2 <- fp(x, 0) rbind( colMeans(y1), colMeans(y2) ) rowSums(y1) rowSums(y2)
Mean vector or matrix with mean vectors of compositional data using the -transformation.
frechet(x, a)
frechet(x, a)
x |
A matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
The power transformation is applied to the compositional data and the mean vector is calculated. Then the inverse of it is calculated and the inverse of the power transformation applied to the last vector is the Frechet mean.
If is a single value, the function will return a vector with the Frechet mean for the given value of
. Otherwise the function will return a matrix with the Frechet means for each value of
.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) frechet(x, 0.2) frechet(x, 1)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) frechet(x, 0.2) frechet(x, 1)
The Helmert sub-matrix.
helm(n)
helm(n)
n |
A number grater than or equal to 2. |
The Helmert sub-matrix is returned. It is an orthogonal matrix without the first row.
A matrix.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
John Aitchison (2003). The Statistical Analysis of Compositional Data, p. 99. Blackburn Press.
Lancaster H. O. (1965). The Helmert matrices. The American Mathematical Monthly 72(1): 4-12.
helm(3) helm(5)
helm(3) helm(5)
-distance
The k-nearest neighbours using the -distance.
alfann(xnew, x, a, k = 10, rann = FALSE)
alfann(xnew, x, a, k = 10, rann = FALSE)
xnew |
A matrix or a vector with new compositional data. |
x |
A matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1.
If zero values are present it has to be greater than 0. If |
k |
The number of nearest neighbours to search for. |
rann |
If you have large scale datasets and want a faster k-NN search, you can use kd-trees implemented in the R package "Rnanoflann". In this case you must set this argument equal to TRUE. Note however, that in this case, the only available distance is by default "euclidean". |
The -transformation is applied to the compositional data first
and the indices of the k-nearest neighbours using the Euclidean distance
are returned.
A matrix including the indices of the nearest neighbours of each xnew from x.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
MTsagris M., Alenazi A. and Stewart C. (2023). Flexible non-parametric regression models for compositional response data with zeros. Statistics and Computing, 33(106).
https://link.springer.com/article/10.1007/s11222-023-10277-5
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain.
https://arxiv.org/pdf/1106.1451.pdf
alfa.knn, comp.nb, alfa.rda, alfa.nb,
link{aknn.reg}, alfa, alfainv
library(MASS) xnew <- as.matrix(fgl[1:20, 2:9]) xnew <- xnew / rowSums(xnew) x <- as.matrix(fgl[-c(1:20), 2:9]) x <- x / rowSums(x) b <- alfann(xnew, x, a = 0.1, k = 10)
library(MASS) xnew <- as.matrix(fgl[1:20, 2:9]) xnew <- xnew / rowSums(xnew) x <- as.matrix(fgl[-c(1:20), 2:9]) x <- x / rowSums(x) b <- alfann(xnew, x, a = 0.1, k = 10)
The k-NN algorithm for compositional data with and without using the power transformation.
comp.knn(xnew, x, ina, a = 1, k = 5, apostasi = "ESOV", mesos = TRUE) alfa.knn(xnew, x, ina, a = 1, k = 5, mesos = TRUE, apostasi = "euclidean", rann = FALSE) ait.knn(xnew, x, ina, a = 1, k = 5, mesos = TRUE, apostasi = "euclidean", rann = FALSE)
comp.knn(xnew, x, ina, a = 1, k = 5, apostasi = "ESOV", mesos = TRUE) alfa.knn(xnew, x, ina, a = 1, k = 5, mesos = TRUE, apostasi = "euclidean", rann = FALSE) ait.knn(xnew, x, ina, a = 1, k = 5, mesos = TRUE, apostasi = "euclidean", rann = FALSE)
xnew |
A matrix with the new compositional data whose group is to be predicted. Zeros
are allowed, but you must be careful to choose strictly positive values
of |
x |
A matrix with the available compositional data. Zeros are allowed, but you
must be careful to choose strictly positive values of |
ina |
A group indicator variable for the available data. |
a |
The value of |
k |
The number of nearest neighbours to consider. It can be a single number or a vector. |
apostasi |
The type of distance to use. For the compk.knn this can be one of the following: "ESOV", "taxicab", "Ait", "Hellinger", "angular" or "CS". See the references for them. For the alfa.knn this can be either "euclidean" or "manhattan". |
mesos |
This is used in the non standard algorithm. If TRUE, the arithmetic mean of the distances is calulated, otherwise the harmonic mean is used (see details). |
rann |
If you have large scale datasets and want a faster k-NN search, you can use kd-trees implemented in the R package "Rnanoflann". In this case you must set this argument equal to TRUE. Note however, that in this case, the only available distance is by default "euclidean". |
The k-NN algorithm is applied for the compositional data. There are many metrics and possibilities to choose from. The algorithm finds the k nearest observations to a new observation and allocates it to the class which appears most times in the neighbours. It then computes the arithmetic or the harmonic mean of the distances. The new point is allocated to the class with the minimum distance.
A vector with the estimated groups.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris, Michail (2014). The k-NN algorithm for compositional data: a revised approach with and without zero values present. Journal of Data Science, 12(3): 519–534.
Friedman Jerome, Trevor Hastie and Robert Tibshirani (2009). The elements of statistical learning, 2nd edition. Springer, Berlin
Tsagris Michail, Simon Preston and Andrew T.A. Wood (2016).
Improved classification for compositional data using the
-transformation. Journal of Classification 33(2): 243–261.
Connie Stewart (2017). An approach to measure distance between compositional diet estimates containing essential zeros. Journal of Applied Statistics 44(7): 1137–1152.
Clarotto L., Allard D. and Menafoglio A. (2022). A new class of
-transformations for the spatial analysis of Compositional Data.
Spatial Statistics, 47.
Endres, D. M. and Schindelin, J. E. (2003). A new metric for probability distributions. Information Theory, IEEE Transactions on 49, 1858–1860.
Osterreicher, F. and Vajda, I. (2003). A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics 55, 639–653.
compknn.tune, alfa.rda, comp.nb, alfa.nb, alfa,
esov, mix.compnorm
x <- as.matrix( iris[, 1:4] ) x <- x/ rowSums(x) ina <- iris[, 5] mod <- comp.knn(x, x, ina, a = 1, k = 5) table(ina, mod) mod2 <- alfa.knn(x, x, ina, a = 1, k = 5) table(ina, mod2)
x <- as.matrix( iris[, 1:4] ) x <- x/ rowSums(x) ina <- iris[, 5] mod <- comp.knn(x, x, ina, a = 1, k = 5) table(ina, mod) mod2 <- alfa.knn(x, x, ina, a = 1, k = 5) table(ina, mod2)
The multiplicative log-ratio transformation and its inverse.
mlr(x) mlrinv(y)
mlr(x) mlrinv(y)
x |
A numerical matrix with the compositional data. |
y |
A numerical matrix with data to be closed into the simplex. |
The multiplicative log-ratio transformation and its inverse are applied here. This means that no zeros are allowed.
A matrix with the mlr transformed data (if mlr is used) or with the compositional data (if the mlrinv is used).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- mlr(x) x1 <- mlrinv(y)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- mlr(x) x1 <- mlrinv(y)
The pivot coordinate transformation and its inverse.
pivot(x) pivotinv(y)
pivot(x) pivotinv(y)
x |
A numerical matrix with the compositional data. |
y |
A numerical matrix with data to be closed into the simplex. |
The pivot coordinate transformation and its inverse are computed. This means that no zeros are allowed.
A matrix with the alr transformed data (if pivot is used) or with the compositional data (if the pivotinv is used).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Peter Filzmoser, Karel Hron and Matthias Templ (2018). Applied Compositional Data Analysis With Worked Examples in R (pages 49 and 51). Springer.
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- pivot(x) x1 <- alrinv(y)
library(MASS) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) y <- pivot(x) x1 <- alrinv(y)
Simplicial constrained linear least squares (SCLS) for compositional responses and predictors.
scls(y, x, xnew = NULL, nbcores = 4)
scls(y, x, xnew = NULL, nbcores = 4)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. It may also by a big matrix of the FBM class. |
x |
A matrix with the compositional predictors. Zero values are allowed. It may also by a big matrix of the FBM class. |
xnew |
If you have new data use it, otherwise leave it NULL. |
nbcores |
The number of cores to use in the case of an FBM class (big) matrix. If you do not know how many to cores to use, you may try the command nb_cores() from the bigparallelr package. |
The function performs least squares regression where the beta coefficients are constained to be positive and sum to 1. We were inspired by the transformation-free linear regression for compositional responses and predictors of Fiksel, Zeger and Datta (2022). Our implementation now uses quadratic programming instead of the function optim
, and the solution is more accurate and extremely fast.
Big matrices, of FBM class, are now accepted.
A list including:
mse |
The mean squared error. |
be |
The beta coefficients. |
est |
The fitted of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
cv.scls, tflr, scls.indeptest, scrq
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- scls(y, x) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- scls(y, x) mod
The SCLS model with multiple compositional predictors.
scls2(y, x, wei = FALSE, xnew = NULL)
scls2(y, x, wei = FALSE, xnew = NULL)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A list of matrices with the compositional predictors. Zero values are allowed. |
wei |
Do you want weights among the different simplicial predictors? The default is FALSE. |
xnew |
If you have new data use it, otherwise leave it NULL. |
The function performs least squares regression where the beta coefficients are constained to be positive and sum to 1. We were inspired by the transformation-free linear regression for compositional responses and predictors of Fiksel, Zeger and Datta (2020). Our implementation now uses quadratic programming instead of the function optim
, and the solution is more accurate and extremely fast. This function allows for more than one simplicial predictors and offers the possibility of assigning weights to each simplicial predictor.
A list including:
ini.mse |
The mean squared error when all simplicial predictors carry equal weight. |
ini.be |
The beta coefficients when all simplicial predictors carry equal weight. |
mse |
The mean squared error when the simplicial predictors carry unequal weights. |
weights |
The weights in a vector form. A vector of length equal to the number of rows of the matrix of coefficients. |
am |
The vector of weights, one for each simplicia predictor. The length of the vector is equal to the number of simplicial predictors. |
est |
The fitted of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x1 <- as.matrix(fgl[, 2:9]) x <- list() x[[ 1 ]] <- x1 / rowSums(x1) x[[ 2 ]] <- Compositional::rdiri(214, runif(4)) mod <- scls2(y, x) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x1 <- as.matrix(fgl[, 2:9]) x <- list() x[[ 1 ]] <- x1 / rowSums(x1) x[[ 2 ]] <- Compositional::rdiri(214, runif(4)) mod <- scls2(y, x) mod
The TFLR model with multiple compositional predictors
tflr2(y, x, wei = FALSE, xnew = NULL)
tflr2(y, x, wei = FALSE, xnew = NULL)
y |
A matrix with the compositional data (dependent variable). Zero values are allowed. |
x |
A list of matrices with the compositional predictors. Zero values are allowed. |
wei |
Do you want weights among the different simplicial predictors? The default is FALSE. |
xnew |
If you have new data use it, otherwise leave it NULL. |
The transformation-free linear regression for compositional responses and predictors is implemented.
The function to be minized is . This is a self implementation of the function that can be found in the package codalm. This function allows for more than one simplicial predictors and offers the possibility of assigning weights to each simplicial predictor.
A list including:
ini.mse |
The mean squared error when all simplicial predictors carry equal weight. |
ini.be |
The beta coefficients when all simplicial predictors carry equal weight. |
mse |
The mean squared error when the simplicial predictors carry unequal weights. |
weights |
The weights in a vector form. A vector of length equal to the number of rows of the matrix of coefficients. |
am |
The vector of weights, one for each simplicia predictor. The length of the vector is equal to the number of simplicial predictors. |
est |
The fitted of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x1 <- as.matrix(fgl[, 2:9]) x <- list() x[[ 1 ]] <- x1 / rowSums(x1) x[[ 2 ]] <- Compositional::rdiri(214, runif(4)) mod <- tflr2(y, x) mod
library(MASS) set.seed(1234) y <- rdiri(214, runif(4, 1, 3)) x1 <- as.matrix(fgl[, 2:9]) x <- list() x[[ 1 ]] <- x1 / rowSums(x1) x[[ 2 ]] <- Compositional::rdiri(214, runif(4)) mod <- tflr2(y, x) mod
Transformation-free linear regression (TFLR) for compositional responses and predictors.
tflr(y, x, xnew = NULL)
tflr(y, x, xnew = NULL)
y |
A matrix with the compositional response. Zero values are allowed. |
x |
A matrix with the compositional predictors. Zero values are in general allowed, but there can be cases when these are problematic. |
xnew |
If you have new data use it, otherwise leave it NULL. |
The transformation-free linear regression for compositional responses and predictors is implemented.
The function to be minized is . This is an efficient self implementation.
A list including:
kl |
The Kullback-Leibler divergence between the observed and the fitted response compositional data. |
be |
The beta coefficients. |
est |
The fitted values of xnew if xnew is not NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fiksel J., Zeger S. and Datta A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78(3): 974–987.
Tsagris. M. (2024). Constrained least squares simplicial-simplicial regression. https://arxiv.org/pdf/2403.19835.pdf
library(MASS) y <- rdiri(214, runif(3, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- tflr(y, x, x) mod
library(MASS) y <- rdiri(214, runif(3, 1, 3)) x <- as.matrix(fgl[, 2:9]) x <- x / rowSums(x) mod <- tflr(y, x, x) mod
Total variability.
totvar(x, a = 0)
totvar(x, a = 0)
x |
A numerical matrix with the compositional data. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0.
If |
The -transformation is applied and the sum of the variances of the transformed variables is calculated.
This is the total variability. Aitchison (1986) used the centred log-ratio transformation, but we have extended it to
cover more geometries, via the
-transformation.
The total variability of the data in a given geometry as dictated by the value of .
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
alfa, \ link{alfainv,} alfa.profile, alfa.tune
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) totvar(x)
x <- as.matrix(iris[, 1:4]) x <- x / rowSums(x) totvar(x)
-generalised correlations between two compositional datasets
Tuning of the -generalised correlations between two compositional datasets.
acor.tune(y, x, a, type = "dcor")
acor.tune(y, x, a, type = "dcor")
y |
A matrix with the compositional data. |
x |
A matrix with the compositional data. |
a |
The range of values of the power transformation to search for the optimal one. If zero values are present it has to be greater than 0. |
type |
the type of correlation to compute, the distance correlation ("edist"), the canonical correlation type 1 ("cancor1") or the canonical correlation type 2 ("cancor2"). See details for more information. |
The -transformation is applied to each composition and then, if type="dcor" the
distance correlation or the canonical correlation is computed. If type =
"cancor1" the function returns the value of
that maximizes the
product of the eigenvalues. If type = "cancor2" the function returns the value
of
that maximizes the the largest eigenvalue.
A list including:
alfa |
The optimal value of |
acor |
The maximum value of the acor. |
runtime |
The runtime of the optimization |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
acor, alfa.profile, alfa, alfainv
y <- rdiri(30, runif(3) ) x <- rdiri(30, runif(4) ) acor(y, x, a = 0.4)
y <- rdiri(30, runif(3) ) x <- rdiri(30, runif(4) ) acor(y, x, a = 0.4)
Tuning of the bandwidth h of the kernel using the maximum likelihood cross validation.
mkde.tune( x, low = 0.1, up = 3, s = cov(x) )
mkde.tune( x, low = 0.1, up = 3, s = cov(x) )
x |
A matrix with Euclidean (continuous) data. |
low |
The minimum value to search for the optimal bandwidth value. |
up |
The maximum value to search for the optimal bandwidth value. |
s |
A covariance matrix. By default it is equal to the covariance matrix of the data, but can change to a robust covariance matrix, MCD for example. |
Maximum likelihood cross validation is applied in order to choose the optimal value of the bandwidth parameter. No plot is produced.
A list including:
hopt |
The optimal bandwidth value. |
maximum |
The value of the pseudo-log-likelihood at that given bandwidth value. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Arsalane Chouaib Guidoum (2015). Kernel Estimator and Bandwidth Selection for Density and its Derivatives. The kedd R package. http://cran.r-project.org/web/packages/kedd/vignettes/kedd.pdf
M.P. Wand and M.C. Jones (1995). Kernel smoothing, pages 91-92.
library(MASS) mkde.tune(as.matrix(iris[, 1:4]), c(0.1, 3) )
library(MASS) mkde.tune(as.matrix(iris[, 1:4]), c(0.1, 3) )
-transformation
Tuning of the divergence based regression for compositional data with compositional data in the covariates side using the -transformation.
klalfapcr.tune(y, x, covar = NULL, nfolds = 10, maxk = 50, a = seq(-1, 1, by = 0.1), folds = NULL, graph = FALSE, tol = 1e-07, maxiters = 50, seed = NULL)
klalfapcr.tune(y, x, covar = NULL, nfolds = 10, maxk = 50, a = seq(-1, 1, by = 0.1), folds = NULL, graph = FALSE, tol = 1e-07, maxiters = 50, seed = NULL)
y |
A numerical matrix with compositional data with or without zeros. |
x |
A matrix with the predictor variables, the compositional data. Zero values are allowed. |
covar |
If you have other continuous covariates put themn here. |
nfolds |
The number of folds for the K-fold cross validation, set to 10 by default. |
maxk |
The maximum number of principal components to check. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0.
If |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
graph |
If graph is TRUE (default value) a plot will appear. |
tol |
The tolerance value to terminate the Newton-Raphson procedure. |
maxiters |
The maximum number of Newton-Raphson iterations. |
seed |
You can specify your own seed number here or leave it NULL. |
The M-fold cross validation is performed in order to select the optimal values for and k, the number of principal components.
The
-transformation is applied to the compositional data first, the first k principal component scores are calcualted and used as predictor variables for the Kullback-Leibler divergence based regression model. This procedure is performed M times during the M-fold cross validation.
A list including:
mspe |
A list with the KL divergence for each value of |
performance |
A matrix with the KL divergence for each value of |
best.perf |
The minimum KL divergence. |
params |
The values of |
Initial code by Abdulaziz Alenazi. Modifications by Michail Tsagris.
R implementation and documentation: Abdulaziz Alenazi [email protected] and Michail Tsagris [email protected].
Alenazi A. (2019). Regression for compositional data with compositional data as predictor variables with or without zero values. Journal of Data Science, 17(1): 219–238. https://jds-online.org/journal/JDS/article/136/file/pdf
Tsagris M. (2015). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2): 47–57. http://arxiv.org/pdf/1508.01913v1.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. http://arxiv.org/pdf/1106.1451.pdf
kl.alfapcr, cv.tflr, glm.pcr, alfapcr.tune
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) x <- x / rowSums(x) mod <- klalfapcr.tune(y = y, x = x, a = c(0.7, 0.8) ) mod
library(MASS) y <- rdiri( 214, runif(4, 1, 3) ) x <- as.matrix( fgl[, 2:9] ) x <- x / rowSums(x) mod <- klalfapcr.tune(y = y, x = x, a = c(0.7, 0.8) ) mod
Tuning of the k-NN algorithm for compositional data with and without using the
power or the -transformation. In addition, estimation of the rate
of correct classification via K-fold cross-validation.
compknn.tune(x, ina, nfolds = 10, k = 2:5, mesos = TRUE, a = seq(-1, 1, by = 0.1), apostasi = "ESOV", folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE) alfaknn.tune(x, ina, nfolds = 10, k = 2:5, mesos = TRUE, a = seq(-1, 1, by = 0.1), apostasi = "euclidean", rann = FALSE, folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE) aitknn.tune(x, ina, nfolds = 10, k = 2:5, mesos = TRUE, a = seq(-1, 1, by = 0.1), apostasi = "euclidean", rann = FALSE, folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE)
compknn.tune(x, ina, nfolds = 10, k = 2:5, mesos = TRUE, a = seq(-1, 1, by = 0.1), apostasi = "ESOV", folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE) alfaknn.tune(x, ina, nfolds = 10, k = 2:5, mesos = TRUE, a = seq(-1, 1, by = 0.1), apostasi = "euclidean", rann = FALSE, folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE) aitknn.tune(x, ina, nfolds = 10, k = 2:5, mesos = TRUE, a = seq(-1, 1, by = 0.1), apostasi = "euclidean", rann = FALSE, folds = NULL, stratified = TRUE, seed = NULL, graph = FALSE)
x |
A matrix with the available compositional data. Zeros are allowed, but you
must be careful to choose strictly positive values of |
ina |
A group indicator variable for the available data. |
nfolds |
The number of folds to be used. This is taken into consideration only if the folds argument is not supplied. |
k |
A vector with the nearest neighbours to consider. |
mesos |
This is used in the non standard algorithm. If TRUE, the arithmetic mean of the distances is calculated, otherwise the harmonic mean is used (see details). |
a |
A grid of values of |
apostasi |
The type of distance to use. For the compk.knn this can be one of the following: "ESOV", "taxicab", "Ait", "Hellinger", "angular" or "CS". See the references for them. For the alfa.knn this can be either "euclidean" or "manhattan". |
rann |
If you have large scale datasets and want a faster k-NN search, you can use kd-trees implemented in the R package "Rnanoflann". In this case you must set this argument equal to TRUE. Note however, that in this case, the only available distance is by default "euclidean". |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
stratified |
Do you want the folds to be created in a stratified way? TRUE or FALSE. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If set to TRUE a graph with the results will appear. |
The k-NN algorithm is applied for the compositional data. There are many metrics and possibilities to choose from. The algorithm finds the k nearest observations to a new observation and allocates it to the class which appears most times in the neighbours.
A list including:
per |
A matrix or a vector (depending on the distance chosen) with the averaged over
all folds rates of correct classification for all hyper-parameters
( |
performance |
The estimated rate of correct classification. |
best_a |
The best value of |
best_k |
The best number of nearest neighbours. |
runtime |
The run time of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris, Michail (2014). The k-NN algorithm for compositional data: a revised approach with and without zero values present. Journal of Data Science, 12(3): 519–534. https://arxiv.org/pdf/1506.05216.pdf
Friedman Jerome, Trevor Hastie and Robert Tibshirani (2009). The elements of statistical learning, 2nd edition. Springer, Berlin
Tsagris M., Preston S. and Wood A.T.A. (2016). Improved classification for
compositional data using the -transformation.
Journal of Classification, 33(2): 243–261.
http://arxiv.org/pdf/1106.1451.pdf
Connie Stewart (2017). An approach to measure distance between compositional diet estimates containing essential zeros. Journal of Applied Statistics 44(7): 1137–1152.
Clarotto L., Allard D. and Menafoglio A. (2022).
A new class of -transformations for the spatial analysis
of Compositional Data. Spatial Statistics, 47.
Endres, D. M. and Schindelin, J. E. (2003). A new metric for probability distributions. Information Theory, IEEE Transactions on 49, 1858–1860.
Osterreicher, F. and Vajda, I. (2003). A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics 55, 639–653.
comp.knn, alfarda.tune, cv.dda, cv.compnb
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) ina <- iris[, 5] mod1 <- compknn.tune(x, ina, a = seq(1, 1, by = 0.1) ) mod2 <- alfaknn.tune(x, ina, a = seq(-1, 1, by = 0.1) )
x <- as.matrix(iris[, 1:4]) x <- x/ rowSums(x) ina <- iris[, 5] mod1 <- compknn.tune(x, ina, a = seq(1, 1, by = 0.1) ) mod2 <- alfaknn.tune(x, ina, a = seq(-1, 1, by = 0.1) )
Tuning of the projection pursuit regression for compositional data.
compppr.tune(y, x, nfolds = 10, folds = NULL, seed = NULL, nterms = 1:10, type = "alr", yb = NULL )
compppr.tune(y, x, nfolds = 10, folds = NULL, seed = NULL, nterms = 1:10, type = "alr", yb = NULL )
y |
A matrix with the available compositional data, but zeros are not allowed. |
x |
A matrix with the continuous predictor variables. |
nfolds |
The number of folds to use. |
folds |
If you have the list with the folds supply it here. |
seed |
You can specify your own seed number here or leave it NULL. |
nterms |
The number of terms to try in the projection pursuit regression. |
type |
Either "alr" or "ilr" corresponding to the additive or the isometric log-ratio transformation respectively. |
yb |
If you have already transformed the data using a log-ratio transformation put it here. Othewrise leave it NULL. |
The function performs tuning of the projection pursuit regression algorithm.
A list including:
kl |
The average Kullback-Leibler divergence. |
perf |
The average Kullback-Leibler divergence. |
runtime |
The run time of the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817-823. doi: 10.2307/2287576.
comp.ppr, aknnreg.tune, akernreg.tune
y <- as.matrix(iris[, 1:3]) y <- y/ rowSums(y) x <- iris[, 4] mod <- compppr.tune(y, x)
y <- as.matrix(iris[, 1:3]) y <- y/ rowSums(y) x <- iris[, 4] mod <- compppr.tune(y, x)
Tuning of the projection pursuit regression with compositional predictor variables.
pprcomp.tune(y, x, nfolds = 10, folds = NULL, seed = NULL, nterms = 1:10, type = "log", graph = FALSE)
pprcomp.tune(y, x, nfolds = 10, folds = NULL, seed = NULL, nterms = 1:10, type = "log", graph = FALSE)
y |
A numerical vector with the continuous variable. |
x |
A matrix with the available compositional data, but zeros are not allowed. |
nfolds |
The number of folds to use. |
folds |
If you have the list with the folds supply it here. |
seed |
You can specify your own seed number here or leave it NULL. |
nterms |
The number of terms to try in the projection pursuit regression. |
type |
Either "alr" or "log" corresponding to the additive log-ratio transformation or the logarithm applied to the compositional predictor variables. |
graph |
If graph is TRUE (default value) a filled contour plot will appear. |
The function performs tuning of the projection pursuit regression algorithm with compositional predictor variables.
A list including:
runtime |
The run time of the cross-validation procedure. |
mse |
The mean squared error of prediction for each number of terms. |
opt.nterms |
The number of terms corresponding to the minimum mean squared error of prediction. |
opt.alpha |
The value of |
performance |
The minimum mean squared error of prediction. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817-823. doi: 10.2307/2287576.
pprcomp, ice.pprcomp, alfapcr.tune, compppr.tune
x <- as.matrix(iris[, 2:4]) x <- x/ rowSums(x) y <- iris[, 1] mod <- pprcomp.tune(y, x)
x <- as.matrix(iris[, 2:4]) x <- x/ rowSums(x) y <- iris[, 1] mod <- pprcomp.tune(y, x)
-transformation
Tuning of the projection pursuit regression with compositional predictor variables using the -transformation.
alfapprcomp.tune(y, x, nfolds = 10, folds = NULL, seed = NULL, nterms = 1:10, a = seq(-1, 1, by = 0.1), graph = FALSE)
alfapprcomp.tune(y, x, nfolds = 10, folds = NULL, seed = NULL, nterms = 1:10, a = seq(-1, 1, by = 0.1), graph = FALSE)
y |
A numerical vector with the continuous variable. |
x |
A matrix with the available compositional data. Zeros are allowed. |
nfolds |
The number of folds to use. |
folds |
If you have the list with the folds supply it here. |
seed |
You can specify your own seed number here or leave it NULL. |
nterms |
The number of terms to try in the projection pursuit regression. |
a |
A vector with the values of |
graph |
If graph is TRUE (default value) a filled contour plot will appear. |
The function performs tuning of the projection pursuit regression algorithm with
compositional predictor variables using the -transformation.
A list including:
runtime |
The run time of the cross-validation procedure. |
mse |
The mean squared error of prediction for each number of terms. |
opt.nterms |
The number of terms corresponding to the minimum mean squared error of prediction. |
opt.alpha |
The value of |
performance |
The minimum mean squared error of prediction. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817-823. doi: 10.2307/2287576.
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
alfa.pprcomp, pprcomp.tune, compppr.tune
x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) y <- iris[, 1] mod <- alfapprcomp.tune( y, x, a = c(0, 0.5, 1) )
x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) y <- iris[, 1] mod <- alfapprcomp.tune( y, x, a = c(0, 0.5, 1) )
-transformation
This is a cross-validation procedure to decide on the number of principal components when using regression with compositional data (as predictor variables) using the -transformation.
alfapcr.tune(y, x, model = "gaussian", nfolds = 10, maxk = 50, a = seq(-1, 1, by = 0.1), folds = NULL, ncores = 1, graph = TRUE, col.nu = 15, seed = NULL)
alfapcr.tune(y, x, model = "gaussian", nfolds = 10, maxk = 50, a = seq(-1, 1, by = 0.1), folds = NULL, ncores = 1, graph = TRUE, col.nu = 15, seed = NULL)
y |
A vector with either continuous, binary or count data. |
x |
A matrix with the predictor variables, the compositional data. Zero values are allowed. |
model |
The type of regression model to fit. The possible values are "gaussian", "binomial" and "poisson". |
nfolds |
The number of folds for the K-fold cross validation, set to 10 by default. |
maxk |
The maximum number of principal components to check. |
a |
A vector with a grid of values of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
ncores |
How many cores to use. If you have heavy computations or do not want to wait for long time more than 1 core (if available) is suggested. It is advisable to use it if you have many observations and or many variables, otherwise it will slow down th process. |
graph |
If graph is TRUE (default value) a filled contour plot will appear. |
col.nu |
A number parameter for the filled contour plot, taken into account only if graph is TRUE. |
seed |
You can specify your own seed number here or leave it NULL. |
The -transformation is applied to the compositional data first and the function "pcr.tune" or "glmpcr.tune" is called.
If graph is TRUE a filled contour will appear. A list including:
mspe |
The MSPE where rows correspond to the |
best.par |
The best pair of |
performance |
The minimum mean squared error of prediction. |
runtime |
The time required by the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2015). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2): 47-57. https://arxiv.org/pdf/1508.01913v1.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
Jolliffe I.T. (2002). Principal Component Analysis.
alfa, profile, alfa.pcr, pcr.tune, glmpcr.tune, glm
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x/ rowSums(x) mod <- alfapcr.tune(y, x, nfolds = 10, maxk = 50, a = seq(-1, 1, by = 0.1) )
library(MASS) y <- as.vector(fgl[, 1]) x <- as.matrix(fgl[, 2:9]) x <- x/ rowSums(x) mod <- alfapcr.tune(y, x, nfolds = 10, maxk = 50, a = seq(-1, 1, by = 0.1) )
Tuning the parameters of the regularised discriminant analysis for Eucldiean data.
rda.tune(x, ina, nfolds = 10, gam = seq(0, 1, by = 0.1), del = seq(0, 1, by = 0.1), ncores = 1, folds = NULL, stratified = TRUE, seed = NULL)
rda.tune(x, ina, nfolds = 10, gam = seq(0, 1, by = 0.1), del = seq(0, 1, by = 0.1), ncores = 1, folds = NULL, stratified = TRUE, seed = NULL)
x |
A matrix with the data. |
ina |
A group indicator variable for the avaiable data. |
nfolds |
The number of folds in the cross validation. |
gam |
A grid of values for the |
del |
A grid of values for the |
ncores |
The number of cores to use. If more than 1, parallel computing will take place. It is advisable to use it if you have many observations and or many variables, otherwise it will slow down th process. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
stratified |
Do you want the folds to be created in a stratified way? TRUE or FALSE. |
seed |
You can specify your own seed number here or leave it NULL. |
Cross validation is performed to select the optimal parameters for the regularisded discriminant analysis and also estimate the rate of accuracy.
The covariance matrix of each group is calcualted and then the pooled covariance matrix. The spherical covariance matrix consists of the average of the pooled variances in its diagonal and zeros in the off-diagonal elements. gam is the weight of the pooled covariance matrix and 1-gam is the weight of the spherical covariance matrix, Sa = gam * Sp + (1-gam) * sp. Then it is a compromise between LDA and QDA. del is the weight of Sa and 1-del the weight of each group covariance group.
A list including: If graph is TRUE a plot of a heatmap of the performance s will appear.
per |
An array with the estimate rate of correct classification for every fold. For each of the M matrices, the row values correspond to gam and the columns to the del parameter. |
percent |
A matrix with the mean estimated rates of correct classification. The row values correspond to gam and the columns to the del parameter. |
se |
A matrix with the standard error of the mean estimated rates of correct classification. The row values correspond to gam and the columns to the del parameter. |
result |
The estimated rate of correct classification along with the best gam and del parameters. |
runtime |
The time required by the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J.H. (1989): Regularized Discriminant Analysis. Journal of the American Statistical Association 84(405): 165–175.
Friedman Jerome, Trevor Hastie and Robert Tibshirani (2009). The elements of statistical learning, 2nd edition. Springer, Berlin.
Tsagris M., Preston S. and Wood A.T.A. (2016). Improved classification for
compositional data using the -transformation.
Journal of Classification, 33(2): 243–261.
mod <- rda.tune(as.matrix(iris[, 1:4]), iris[, 5], gam = seq(0, 1, by = 0.2), del = seq(0, 1, by = 0.2) ) mod
mod <- rda.tune(as.matrix(iris[, 1:4]), iris[, 5], gam = seq(0, 1, by = 0.2), del = seq(0, 1, by = 0.2) ) mod
Tuning the number of principal components in the generalised linear models.
pcr.tune(y, x, nfolds = 10, maxk = 50, folds = NULL, ncores = 1, seed = NULL, graph = TRUE) glmpcr.tune(y, x, nfolds = 10, maxk = 10, folds = NULL, ncores = 1, seed = NULL, graph = TRUE) multinompcr.tune(y, x, nfolds = 10, maxk = 10, folds = NULL, ncores = 1, seed = NULL, graph = TRUE)
pcr.tune(y, x, nfolds = 10, maxk = 50, folds = NULL, ncores = 1, seed = NULL, graph = TRUE) glmpcr.tune(y, x, nfolds = 10, maxk = 10, folds = NULL, ncores = 1, seed = NULL, graph = TRUE) multinompcr.tune(y, x, nfolds = 10, maxk = 10, folds = NULL, ncores = 1, seed = NULL, graph = TRUE)
y |
A real valued vector for "pcr.tune". A real valued vector for the "glmpcr.tune" with either two numbers, 0 and 1 for example, for the binomial regression or with positive discrete numbers for the poisson. For the "multinompcr.tune" a vector or a factor with more than just two values. This is a multinomial regression. |
x |
A matrix with the predictor variables, they have to be continuous. |
nfolds |
The number of folds in the cross validation. |
maxk |
The maximum number of principal components to check. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
ncores |
The number of cores to use. If more than 1, parallel computing will take place. It is advisable to use it if you have many observations and or many variables, otherwise it will slow down th process. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If graph is TRUE a plot of the performance for each fold along the values of |
Cross validation is performed to select the optimal number of principal components in the GLMs
or the multinomial regression. This is used
by alfapcr.tune
.
If graph is TRUE a plot of the performance versus the number of principal components will appear. A list including:
msp |
A matrix with the mean deviance of prediction or mean accuracy for every fold. |
mpd |
A vector with the mean deviance of prediction or mean accuracy, each value corresponds to a number of principal components. |
k |
The number of principal components which minimizes the deviance or maximises the accuracy. |
performance |
The optimal performance, MSE for the linea regression, minimum deviance for the GLMs and maximum accuracy for the multinomial regression. |
runtime |
The time required by the cross-validation procedure. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aguilera A.M., Escabias M. and Valderrama M.J. (2006). Using principal components for estimating logistic regression with high-dimensional multicollinear data. Computational Statistics & Data Analysis 50(8): 1905-1924.
Jolliffe I.T. (2002). Principal Component Analysis.
pcr.tune, glm.pcr, alfa.pcr, alfapcr.tune
library(MASS) x <- as.matrix(fgl[, 2:9]) y <- rpois(214, 10) glmpcr.tune(y, x, nfolds = 10, maxk = 20, folds = NULL, ncores = 1)
library(MASS) x <- as.matrix(fgl[, 2:9]) y <- rpois(214, 10) glmpcr.tune(y, x, nfolds = 10, maxk = 20, folds = NULL, ncores = 1)
in the
-regression
Tuning the value of in the
-regression.
alfareg.tune(y, x, a = seq(0.1, 1, by = 0.1), nfolds = 10, folds = NULL, nc = 1, seed = NULL, graph = FALSE)
alfareg.tune(y, x, a = seq(0.1, 1, by = 0.1), nfolds = 10, folds = NULL, nc = 1, seed = NULL, graph = FALSE)
y |
A matrix with compositional data. zero values are allowed. |
x |
A matrix with the continuous predictor variables or a data frame including categorical predictor variables. |
a |
The value of the power transformation, it has to be between -1 and 1. If zero values are present it has to be greater than 0. If |
nfolds |
The number of folds to split the data. |
folds |
If you have the list with the folds supply it here. You can also leave it NULL and it will create folds. |
nc |
The number of cores to use. IF you have a multicore computer it is advisable to use more than 1. It makes the procedure faster. It is advisable to use it if you have many observations and or many variables, otherwise it will slow down th process. |
seed |
You can specify your own seed number here or leave it NULL. |
graph |
If graph is TRUE a plot of the performance for each fold along the values of |
The -transformation is applied to the compositional data and the numerical optimisation is performed for the regression, unless
, where the coefficients are available in closed form.
A plot of the estimated Kullback-Leibler divergences (multiplied by 2) along the values of (if graph is set to TRUE).
A list including:
runtime |
The runtime required by the cross-validation. |
kula |
A matrix with twice the Kullback-Leibler divergence of the observed from the fitted values. Each row corresponds to a fold and each column to a value of |
kl |
A vector with twice the Kullback-Leibler divergence of the observed from the fitted values. Every value corresponds to a value of |
opt |
The optimal value of |
value |
The minimum value of twice the Kullback-Leibler. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Giorgos Athineou <[email protected]>.
Tsagris M. (2015). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2): 47-57. https://arxiv.org/pdf/1508.01913v1.pdf
Tsagris M.T., Preston S. and Wood A.T.A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. https://arxiv.org/pdf/1106.1451.pdf
library(MASS) y <- as.matrix(fgl[1:40, 2:4]) y <- y /rowSums(y) x <- as.vector(fgl[1:40, 1]) mod <- alfareg.tune(y, x, a = seq(0, 1, by = 0.1), nfolds = 5)
library(MASS) y <- as.matrix(fgl[1:40, 2:4]) y <- y /rowSums(y) x <- as.vector(fgl[1:40, 1]) mod <- alfareg.tune(y, x, a = seq(0, 1, by = 0.1), nfolds = 5)
Two-sample test of high-dimensional means for compositional data.
hd.meantest2(y1, y2, R = 1)
hd.meantest2(y1, y2, R = 1)
y1 |
A matrix containing the compositional data of the first group. |
y2 |
A matrix containing the compositional data of the second group. |
R |
If R is 1 no bootstrap calibration is performed and the asymptotic p-value is returned. If R is greater than 1, the bootstrap p-value is returned. |
A two sample for high dimensional mean vectors of compositional data is implemented. See references for more details.
A vector with the test statistic value and its associated (bootstrap) p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Cao Y., Lin W. and Li H. (2018). Two-sample tests of high-dimensional means for compositional data. Biometrika, 105(1): 115-132.
m <- runif(200, 10, 15) x1 <- rdiri(100, m) x2 <- rdiri(100, m) hd.meantest2(x1, x2)
m <- runif(200, 10, 15) x1 <- rdiri(100, m) x2 <- rdiri(100, m) hd.meantest2(x1, x2)
Unconstrained GLMs with compositional predictor variables.
ulc.glm(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
ulc.glm(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This is either a binary variable or a vector with counts. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
model |
For the ulc.glm(), this can be either "logistic" or "poisson". |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the unconstrained log-contrast logistic or Poisson regression model. The logarithm of the
compositional predictor variables is used (hence no zero values are allowed). The response variable
is linked to the log-transformed data without the constraint that the sum of the regression coefficients
equals 0. If you want the regression without the zum-to-zero contraints see lc.glm
.
Extra predictors variables are allowed as well, for instance categorical or continuous.
A list including:
devi |
The residual deviance of the logistic or Poisson regression model. |
be |
The unconstrained regression coefficients. Their sum does not equal 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Lu J., Shi P., and Li H. (2019). Generalized linear models with linear constraints for microbiome compositional data. Biometrics, 75(1): 235–244.
lc.glm, lc.glm2, ulc.glm2, lcglm.aov
y <- rbinom(150, 1, 0.5) x <- rdiri(150, runif(3, 1,3)) mod <- ulc.glm(y, x)
y <- rbinom(150, 1, 0.5) x <- rdiri(150, runif(3, 1,3)) mod <- ulc.glm(y, x)
Unconstrained linear regression with compositional predictor variables.
ulc.reg(y, x, z = NULL, xnew = NULL, znew = NULL)
ulc.reg(y, x, z = NULL, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This must be a continuous variable. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the unconstrained log-contrast regression model as opposed to the log-contrast
regression described in Aitchison (2003), pg. 84-85. The logarithm of the compositional predictor variables
is used (hence no zero values are allowed). The response variable is linked to the log-transformed data
without the constraint that the sum of the regression coefficients equals 0. If you want the regression model
with the zum-to-zero contraints see lc.reg
. Extra predictors variables are allowed as well,
for instance categorical or continuous.
A list including:
be |
The unconstrained regression coefficients. Their sum does not equal 0. |
covbe |
If covariance matrix of the constrained regression coefficients. |
va |
The estimated regression variance. |
residuals |
The vector of residuals. |
est |
If the arguments "xnew" and "znew" were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
lc.reg, lcreg.aov, lc.reg2, ulc.reg2, alfa.pcr, alfa.knn.reg
y <- iris[, 1] x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod1 <- ulc.reg(y, x) mod2 <- ulc.reg(y, x, z = iris[, 5])
y <- iris[, 1] x <- as.matrix(iris[, 2:4]) x <- x / rowSums(x) mod1 <- ulc.reg(y, x) mod2 <- ulc.reg(y, x, z = iris[, 5])
Unconstrained linear regression with multiple compositional predictors.
ulc.reg2(y, x, z = NULL, xnew = NULL, znew = NULL)
ulc.reg2(y, x, z = NULL, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This must be a continuous variable. |
x |
A list with multiple matrices with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
xnew |
A matrix containing a list with multiple matrices with compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the unconstrained log-contrast regression model as opposed to the log-contrast
regression described in Aitchison (2003), pg. 84-85. The logarithm of the compositional predictor variables
is used (hence no zero values are allowed). The response variable is linked to the log-transformed data
without the constraint that the sum of the regression coefficients equals 0. If you want the
regression model with the zum-to-zero contraints see lc.reg2
. Extra predictors variables
are allowed as well, for instance categorical or continuous. Similarly to lc.reg2
there
are multiple compositions treated as predictor variables.
A list including:
be |
The unconstrained regression coefficients. Their sum for each composition does not equal 0. |
covbe |
If covariance matrix of the constrained regression coefficients. |
va |
The estimated regression variance. |
residuals |
The vector of residuals. |
est |
If the arguments "xnew" and "znew" were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Xiaokang Liu, Xiaomei Cong, Gen Li, Kendra Maas and Kun Chen (2020). Multivariate Log-Contrast Regression with Sub-Compositional Predictors: Testing the Association Between Preterm Infants' Gut Microbiome and Neurobehavioral Outcome.
lc.reg2, ulc.reg, lc.reg, alfa.pcr, alfa.knn.reg
y <- iris[, 1] x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.reg2(y, x)
y <- iris[, 1] x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- lc.reg2(y, x)
Unconstrained logistic or Poisson regression with multiple compositional predictors.
ulc.glm2(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
ulc.glm2(y, x, z = NULL, model = "logistic", xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. This is either a binary variable or a vector with counts. |
x |
A list with multiple matrices with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
model |
This can be either "logistic" or "poisson". |
xnew |
A matrix containing a list with multiple matrices with compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the unconstrained log-contrast logistic or Poisson regression model. The logarithm of the
compositional predictor variables is used (hence no zero values are allowed). The response variable
is linked to the log-transformed data without the constraint that the sum of the regression coefficients
equals 0. If you want the regression without the zum-to-zero contraints see lc.glm2
.
Extra predictors variables are allowed as well, for instance categorical or continuous.
A list including:
devi |
The residual deviance of the logistic or Poisson regression model. |
be |
The unconstrained regression coefficients. Their sum does not equal 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Lu J., Shi P., and Li H. (2019). Generalized linear models with linear constraints for microbiome compositional data. Biometrics, 75(1): 235–244.
y <- rbinom(150, 1, 0.5) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- ulc.glm2(y, x)
y <- rbinom(150, 1, 0.5) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- ulc.glm2(y, x)
Unconstrained quantile regression with compositional predictor variables.
ulc.rq(y, x, z = NULL, tau = 0.5, xnew = NULL, znew = NULL)
ulc.rq(y, x, z = NULL, tau = 0.5, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. |
x |
A matrix with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
tau |
The quantile to be estimated, a number between 0 and 1. |
xnew |
A matrix containing the new compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the unconstrained log-contrast quantile regression model.
The logarithm of the compositional predictor variables is used (hence no zero
values are allowed). The response variable is linked to the log-transformed data
without the constraint that the sum of the regression coefficients equals 0.
If you want the regression without the zum-to-zero contraints see lc.rq
.
Extra predictors variables are allowed as well, for instance categorical or continuous.
A list including:
mod |
The object as returned by the function quantreg::rq(). This is useful for hypothesis testing purposes. |
be |
The unconstrained regression coefficients. Their sum does not equal 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Koenker R. W. and Bassett G. W. (1978). Regression Quantiles, Econometrica, 46(1): 33–50.
Koenker R. W. and d'Orey V. (1987). Algorithm AS 229: Computing Regression Quantiles. Applied Statistics, 36(3): 383–393.
lc.glm, lc.glm2, ulc.glm2, lcglm.aov
y <- rnorm(150) x <- rdiri(150, runif(3, 1,3)) mod <- ulc.rq(y, x)
y <- rnorm(150) x <- rdiri(150, runif(3, 1,3)) mod <- ulc.rq(y, x)
Unconstrained quantile regression with multiple compositional predictors.
ulc.rq2(y, x, z = NULL, tau = 0.5, xnew = NULL, znew = NULL)
ulc.rq2(y, x, z = NULL, tau = 0.5, xnew = NULL, znew = NULL)
y |
A numerical vector containing the response variable values. |
x |
A list with multiple matrices with the predictor variables, the compositional data. No zero values are allowed. |
z |
A matrix, data.frame, factor or a vector with some other covariate(s). |
tau |
The quantile to be estimated, a number between 0 and 1. |
xnew |
A matrix containing a list with multiple matrices with compositional data whose response is to be predicted. If you have no new data, leave this NULL as is by default. |
znew |
A matrix, data.frame, factor or a vector with the values of some other covariate(s). If you have no new data, leave this NULL as is by default. |
The function performs the unconstrained log-contrast quantile regression model.
The logarithm of the compositional predictor variables is used (hence no zero
values are allowed). The response variable is linked to the log-transformed data
without the constraint that the sum of the regression coefficients
equals 0. If you want the regression without the zum-to-zero contraints see
lc.rq2
. Extra predictors variables are allowed as well, for
instance categorical or continuous.
A list including:
mod |
The object as returned by the function quantreg::rq(). This is useful for hypothesis testing purposes. |
be |
The unconstrained regression coefficients. Their sum does not equal 0. |
est |
If the arguments "xnew" and znew were given these are the predicted or estimated values, otherwise it is NULL. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (1986). The statistical analysis of compositional data. Chapman & Hall.
Koenker R. W. and Bassett G. W. (1978). Regression Quantiles, Econometrica, 46(1): 33–50.
Koenker R. W. and d'Orey V. (1987). Algorithm AS 229: Computing Regression Quantiles. Applied Statistics, 36(3): 383–393.
y <- rnorm(150) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- ulc.rq2(y, x)
y <- rnorm(150) x <- list() x1 <- as.matrix(iris[, 2:4]) x1 <- x1 / rowSums(x1) x[[ 1 ]] <- x1 x[[ 2 ]] <- rdiri(150, runif(4) ) x[[ 3 ]] <- rdiri(150, runif(5) ) mod <- ulc.rq2(y, x)
Unit-Weibull regression models for proportions.
unitweib.reg(y, x, tau = 0.5)
unitweib.reg(y, x, tau = 0.5)
y |
A numerical vector proportions. 0s and 1s are allowed. |
x |
A matrix or a data frame with the predictor variables. |
tau |
The quantile to be used for estimation. The default value is 0.5 yielding the median. |
See the reference paper.
A list including:
loglik |
The loglikelihood of the regression model. |
info |
A matrix with all estimated parameters, their standard error, their Wald-statistic and its associated p-value. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mazucheli J., Menezes A. F. B., Fernandes L. B., de Oliveira R. P. and Ghitany M. E. (2020). The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. Journal of Applied Statistics, 47(6): 954–974.
y <- exp( - rweibull(100, 1, 1) ) x <- matrix( rnorm(100 * 2), ncol = 2 ) a <- unitweib.reg(y, x)
y <- exp( - rweibull(100, 1, 1) ) x <- matrix( rnorm(100 * 2), ncol = 2 ) a <- unitweib.reg(y, x)
Zero adjusted Dirichlet regression.
zadr(y, x, con = TRUE, B = 1, ncores = 2, xnew = NULL) zadr2(y, x, con = TRUE, B = 1, ncores = 2, xnew = NULL)
zadr(y, x, con = TRUE, B = 1, ncores = 2, xnew = NULL) zadr2(y, x, con = TRUE, B = 1, ncores = 2, xnew = NULL)
y |
A matrix with the compositional data (dependent variable). The number of observations (vectors) with no zero values should be more than the columns of the predictor variables. Otherwise, the initial values will not be calculated. |
x |
The predictor variable(s), they can be either continnuous or categorical or both. |
con |
If this is TRUE (default) then the constant term is estimated, otherwise the model includes no constant term. |
B |
If B is greater than 1 bootstrap estimates of the standard error are returned. If you set this greater than 1, then you must define the number of clusters in order to run in parallel. |
ncores |
The number of cores to use when B>1. This is to be used for the case of bootstrap. If B = 1, this is not taken into consideration. If this does not work then you might need to load the doParallel yourselves. |
xnew |
If you have new data use it, otherwise leave it NULL. |
A zero adjusted Dirichlet regression is being fittd. The likelihood conists of two components. The contributions of the non zero compositional values and the contributions of the compositional vectors with at least one zero value. The second component may have many different sub-categories, one for each pattern of zeros. The function "zadr2()" links the covariates to the alpha parameters of the Dirichlet distribution, i.e. it uses the classical parametrization of the distribution. This means, that there is a set of regression parameters for each component.
A list including:
runtime |
The time required by the regression. |
loglik |
The value of the log-likelihood. |
phi |
The precision parameter. |
be |
The beta coefficients. |
seb |
The standard error of the beta coefficients. |
sigma |
Th covariance matrix of the regression parameters (for the mean vector and the phi parameter). |
est |
The fitted or the predicted values (if xnew is not NULL). |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Stewart C. (2018). A Dirichlet regression model for compositional data with zeros. Lobachevskii Journal of Mathematics,39(3): 398–412.
Preprint available from https://arxiv.org/pdf/1410.5011.pdf
zad.est, diri.reg, kl.compreg, ols.compreg, alfa.reg
x <- as.vector(iris[, 4]) y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) mod1 <- diri.reg(y, x) y[sample(1:450, 15) ] <- 0 mod2 <- zadr(y, x)
x <- as.vector(iris[, 4]) y <- as.matrix(iris[, 1:3]) y <- y / rowSums(y) mod1 <- diri.reg(y, x) y[sample(1:450, 15) ] <- 0 mod2 <- zadr(y, x)