Title: | A Collection of Efficient and Extremely Fast R Functions II |
---|---|
Description: | A collection of fast statistical and utility functions for data analysis. Functions for regression, maximum likelihood, column-wise statistics and many more have been included. C++ has been utilized to speed up the functions. References: Tsagris M., Papadakis M. (2018). Taking R to its limits: 70+ tips. PeerJ Preprints 6:e26605v1 <doi:10.7287/peerj.preprints.26605v1>. |
Authors: | Manos Papadakis, Michail Tsagris, Stefanos Fafalios, Marios Dimitriadis and Manos Lasithiotakis. |
Maintainer: | Manos Papadakis <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 0.1.5.2 |
Built: | 2024-11-06 06:42:44 UTC |
Source: | CRAN |
Benchmark - Measure time.
benchmark(...,times,envir=parent.frame(),order=NULL) ## S3 method for class 'benchmark' print(x,...)
benchmark(...,times,envir=parent.frame(),order=NULL) ## S3 method for class 'benchmark' print(x,...)
... |
Expressions to the benchmark function. |
x |
Object of class "benchmark" to print. |
times |
Number of time to measure execution time of the expression. |
envir |
Environment to evaluate the expressions. |
order |
An integer vector to execute the epxressions with this order, otherwise the execution order is random. |
For measuring time we have used C++'s new library "chrono".
The execution time for each expression.
Manos Papadakis
R implementation and documentation: Manos Papadakis [email protected].
benchmark(x <- matrix(runif(10*10),10,10),times=10)
benchmark(x <- matrix(runif(10*10),10,10),times=10)
Lower/upper triangular matrix.
is.lower.tri(x, diag = FALSE) is.upper.tri(x, diag = FALSE)
is.lower.tri(x, diag = FALSE) is.upper.tri(x, diag = FALSE)
x |
A matrix with data. |
diag |
A logical value include the diagonal to the result. |
Check if a matrix is lower or upper triangular. You can also include diagonal to the check.
Manos Papadakis.
R implementation and documentation: Manos Papadakis [email protected].
x <- matrix(runif(10*10),10,10) is.lower.tri(x) is.lower.tri(x,TRUE) is.upper.tri(x) is.upper.tri(x,TRUE)
x <- matrix(runif(10*10),10,10) is.lower.tri(x) is.lower.tri(x,TRUE) is.upper.tri(x) is.upper.tri(x,TRUE)
Column-wise summary statistics with grouping variables.
colGroup(x,ina,method="sum",names=TRUE, std = FALSE)
colGroup(x,ina,method="sum",names=TRUE, std = FALSE)
x |
A matrix with data. |
ina |
A numerical vector specifying the groups. If you have numerical values, do not put zeros, but 1, 2, 3 and so on. The numbers must be consecutive, like 1,2,3,.. Do not put 1, 3, 4 as this will cause C++ to crash. |
method |
One of the: "sum", "min", "max", "median", "var". |
names |
Set the name of the result vector with the unique numbers of group variable. |
std |
A boolean variable specyfying whether you want the variances (FALSE) or the standard deviations (TRUE) of each column. This is taken into account only when method = "var". |
Column wise of grouping variables. You can also include diagonal to the check.
Manos Papadakis.
R implementation and documentation: Manos Papadakis [email protected].
Quantile, colQuantile, rowQuantile
x <- matrix(runif(100 * 5), 100, 5) id <- sample(1:3, 100, TRUE) all.equal( colGroup(x, id), rowsum(x, id) )
x <- matrix(runif(100 * 5), 100, 5) id <- sample(1:3, 100, TRUE) all.equal( colGroup(x, id), rowsum(x, id) )
Merge 2 sorted vectors in 1 sorted vector.
Merge(x,y)
Merge(x,y)
x |
A sorted vector with data. |
y |
A sorted vector with data. |
A sorted vector of the 2 arguments.
Manos Papadakis.
R implementation and documentation: Manos Papadakis [email protected].
x <- 1:10 y <- 1:20 Merge(x,y) x <- y <- NULL
x <- 1:10 y <- 1:20 Merge(x,y) x <- y <- NULL
Split the matrix in lower, upper triangular and diagonal.
lud(x)
lud(x)
x |
A matrix with data. |
A list with 3 fields:
lower |
The lower triangular of argument "x". |
upper |
The upper triangular of argument "x". |
diagonal |
The diagonal elements. |
Manos Papadakis.
R implementation and documentation: Manos Papadakis [email protected].
x <- matrix(runif(10*10),10,10) b<-lud(x)
x <- matrix(runif(10*10),10,10) b<-lud(x)
A collection of Rfast2 functions for data analysis. Note 1: The vast majority of the functions accept matrices only, not data.frames. Note 2: Do not have matrices or vectors with have missing data (i.e NAs). We do no check about them and C++ internally transforms them into zeros (0), so you may get wrong results. Note 3: In general, make sure you give the correct input, in order to get the correct output. We do no checks and this is one of the many reasons we are fast.
Package: | Rfast2 |
Type: | Package |
Version: | 0.1.5.2 |
Date: | 2024-03-10 |
License: | GPL (>= 2.0) |
Maintainer: Manos Papadakis <[email protected]>
Manos Papadakis [email protected], Michail Tsagris [email protected], Stefanos Fafalios <[email protected]>, Marios Dimitriadis <[email protected]>.
Add many single terms to a model.
add.term(y, xinc, xout, devi_0, type = "logistic", logged = FALSE, tol = 1e-07, maxiters = 100, parallel = FALSE)
add.term(y, xinc, xout, devi_0, type = "logistic", logged = FALSE, tol = 1e-07, maxiters = 100, parallel = FALSE)
y |
The response variable. It must be a numerical vector. |
xinc |
The already included indendent variable(s). |
xout |
The independent variables whose conditional association with the response is to be calculated. |
devi_0 |
The deviance for Poisson, logistic, qpoisson, qlogistic and normlog regression or the log-likelihood for the Weibull, spml and multinomial regressions. See the example to understand better. |
type |
The type of regression, "poisson", "logistic", "qpoisson" (quasi Poisson), "qlogistic" (quasi logistic) "normlog" (Gaussian regression with log-link) "weibull", "spml" and "multinom". |
logged |
Should the logarithm of the p-value be returned? TRUE or FALSE. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm when fitting the regression models. |
maxiters |
The maximum number of iterations the Newton-Raphson algorithm will perform. |
parallel |
Should the computations take place in parallel? TRUE or FALSE. |
The function is similar to the built-in function add1. You have already fitted a regression model with some independent variables (xinc). You then add each of the xout variables and test their significance.
A matrix with two columns. The test statistic and its associated (logged) p-value.
Stefanos Fafalios.
R implementation and documentation: Stefanos Fafalios <[email protected]>.
McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
Presnell Brett, Morrison Scott P. and Littell Ramon C. (1998). Projected multivariate linear models for directional data. Journal of the American Statistical Association, 93(443): 1068-1077.
bic.regs, logiquant.regs, sp.logiregs
x <- matrix( rnorm(200 * 10), ncol = 10) y <- rpois(200, 10) devi_0 <- deviance( glm(y ~ x[, 1:2], poisson) ) a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "poisson") y <- rbinom(200, 1, 0.5) devi_0 <- deviance( glm(y ~ x[, 1:2], binomial) ) a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "logistic") y <- rbinom(200, 2, 0.5) devi_0 <- Rfast::multinom.reg(y, x[, 1:2])$loglik a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "multinom") y <- rgamma(200, 3, 1) devi_0 <- Rfast::weib.reg(y, x[, 1:2])$loglik a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "weibull")
x <- matrix( rnorm(200 * 10), ncol = 10) y <- rpois(200, 10) devi_0 <- deviance( glm(y ~ x[, 1:2], poisson) ) a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "poisson") y <- rbinom(200, 1, 0.5) devi_0 <- deviance( glm(y ~ x[, 1:2], binomial) ) a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "logistic") y <- rbinom(200, 2, 0.5) devi_0 <- Rfast::multinom.reg(y, x[, 1:2])$loglik a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "multinom") y <- rgamma(200, 3, 1) devi_0 <- Rfast::weib.reg(y, x[, 1:2])$loglik a <- add.term(y, xinc = x[,1:2], xout = x[, 3:10], devi_0 = devi_0, type= "weibull")
Angular Gaussian random values simulation.
riag(n, mu)
riag(n, mu)
n |
The sample size, a numerical value. |
mu |
The mean vector in |
The algorithm uses univariate normal random values and with some mean. The vectors are then scaled to have unit length.
A matrix with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mardia, K. V. and Jupp, P. E. (2000). Directional statistics. Chicester: John Wiley & Sons.
Paine P.J., Preston S.P., Tsagris M and Wood A.T.A. (2018). An Elliptically Symmetric Angular Gaussian Distribution. Statistics and Computing, 28(3):689–697.
colspml.mle, circ.cor1, circ.cors1
x <- riag(20, rnorm(4, 3, 1))
x <- riag(20, rnorm(4, 3, 1))
Analysis of variance for circular data.
hcf.circaov(u, ina) lr.circaov(u, ina) het.circaov(u, ina) embed.circaov(u, ina)
hcf.circaov(u, ina) lr.circaov(u, ina) het.circaov(u, ina) embed.circaov(u, ina)
u |
A numeric vector containing the data that are expressed in rads. |
ina |
A numerical or factor variable indicating the group of each value. |
The high concentration (hcf.circaov), log-likelihood ratio (lr.circaov), embedding approach (embed.circaov) or the non equal concentration parameters approach (het.circaov) is used.
A vector including:
test |
The value of the test statistic. |
p-value |
The p-value of the test. |
kapa |
The concentration parameter based on all the data. If the het.circaov is used this argument is not returned. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mardia, K. V. and Jupp, P. E. (2000). Directional statistics. Chicester: John Wiley & Sons.
x <- rnorm(60, 2.3, 0.3) ina <- rep(1:3,each = 20) hcf.circaov(x, ina) lr.circaov(x, ina) het.circaov(x, ina) embed.circaov(x, ina)
x <- rnorm(60, 2.3, 0.3) ina <- rep(1:3,each = 20) hcf.circaov(x, ina) lr.circaov(x, ina) het.circaov(x, ina) embed.circaov(x, ina)
backward selection with the F test or the partial correlation coefficient.
lm.bsreg(y, x, alpha = 0.05, type = "F")
lm.bsreg(y, x, alpha = 0.05, type = "F")
y |
The dependent variable, a numerical vector with numbers. |
x |
A numerical matrix with the indendent variables. We add, internally, the first column of ones. |
alpha |
If you want to perform the usual F (or t) test set this equal to "F". For the test based on the partial correlation set this equal to "cor". |
type |
The type of backward selection to be used, "F" stands for F-test, where "cor" stands for partial correlation. |
It performs backward selection with the F test or the partial correlation coefficient. For the linear regression model, the Wald test is equivalent to the partial F test. So, instead of performing many regression models with single term deletions we perform one regression model with all variables and compute their Wald test effectively. Note, that this is true, only if the design matrix "x" contains the vectors of ones and in our case this must be, strictly, the first column. The second option is to compute the p-value of the partial correlation.
A matrix with two columns. The removed variables and their associated pvalue.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Hastie T., Tibshirani R. and Friedman J. (2008). The Elements of Statistical Learning (2nd Ed.), Springer.
lm.drop1, mmpc2, gee.reg, pc.sel
y <- rnorm(150) x <- as.matrix(iris[, 1:4]) a <- lm(y ~., data.frame(x) ) lm.bsreg(y, x)
y <- rnorm(150) x <- as.matrix(iris[, 1:4]) a <- lm(y ~., data.frame(x) ) lm.bsreg(y, x)
BIC of many simple univariate regressions.
bic.regs(y, x, family = "normal")
bic.regs(y, x, family = "normal")
y |
The dependent variable, a numerical vector. |
x |
A matrix with the indendent variables. |
family |
The family of the regression models. "normal", "binomial", "poisson", "multinomial", "normlog" (Gaussian regression with log link), "spmpl" (SPML regression) or "weibull" for Weibull regression. |
Many simple univariate regressions are fitted and the BIC of every model is computed.
A vector with the BIC of each regression model.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- rbinom(100, 1, 0.6) x <- matrix( rnorm(100 * 50), ncol = 50 ) bic.regs(y, x, "binomial")
y <- rbinom(100, 1, 0.6) x <- matrix( rnorm(100 * 50), ncol = 50 ) bic.regs(y, x, "binomial")
Binomial regression.
binom.reg(y, ni, x, full = FALSE, tol = 1e-07, maxiters = 100)
binom.reg(y, ni, x, full = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable; a numerical vector with integer values, 0, 1, 2,... The successes. |
ni |
A vector with integer values, greater than or equal to y. The trials. |
x |
A matrix with the data, where the rows denote the samples (and the two groups) and the columns are the variables. This can be a matrix or a data.frame (with factors). |
full |
If this is FALSE, the coefficients and the deviance will be returned only. If this is TRUE, more information is returned. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
The difference from logistic regression is that in the binomial regression the binomial distribution is used and not the Bernoulli.
When full is FALSE a list including:
be |
The regression coefficients. |
devi |
The deviance of the model. |
When full is TRUE a list including:
info |
The regression coefficients, their standard error, their Wald test statistic and their p-value. |
devi |
The deviance. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
McCullagh Peter and John A. Nelder. Generalized linear models. CRC Press, USA, 2nd edition, 1989.
x <- matrix(rnorm(100 * 2), ncol = 2) y <- rbinom(100, 20, 0.5) ## binary logistic regression ni <- rep(20, 100) a <- binom.reg(y, ni, x, full = TRUE) x <- NULL
x <- matrix(rnorm(100 * 2), ncol = 2) y <- rbinom(100, 20, 0.5) ## binary logistic regression ni <- rep(20, 100) a <- binom.reg(y, ni, x, full = TRUE) x <- NULL
Bootstrap James and Hotelling test for 2 independent sample mean vectors.
boot.james(y1, y2, R = 999) boot.hotel2(y1, y2, R = 999)
boot.james(y1, y2, R = 999) boot.hotel2(y1, y2, R = 999)
y1 |
A numerical matrix with the data of the one sample. |
y2 |
A numerical matrix with the data of the other sample. |
R |
The number of bootstrap samples to use. |
We bootstrap the 2-samples James (does not assume equal covariance matrics) and Hotelling test
(assumes equal covariance matrics). The difference is that the Hotelling test statistic assumes
equaility of the covariance matrices, which if violated leads to inlfated type I errors.
Bootstrap calibration though takes care of this issue. As for the bootstrap calibration, instead
of sampling B times from each sample, we sample from each of them and then take all pairs.
Each bootstrap sample is independent of each other, hence there is no violation of the theory
(Chatzipantsiou et al., 2019).
The bootstrap p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
G.S. James (1954). Tests of Linear Hypothese in Univariate and Multivariate Analysis when the Ratios of the Population Variances are Unknown. Biometrika, 41(1/2): 19-43
Efron Bradley and Robert J. Tibshirani (1993). An introduction to the bootstrap. New York: Chapman & Hall/CRC.
Chatzipantsiou C., Dimitriadis M., Papadakis M. and Tsagris M. (2019). Extremely efficient permutation and bootstrap hypothesis tests using R. To appear in the Journal of Modern Applied Statistical Methods.
https://arxiv.org/ftp/arxiv/papers/1806/1806.10947.pdf
boot.james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]) )
boot.james( as.matrix(iris[1:25, 1:4]), as.matrix(iris[26:50, 1:4]) )
Bootstrap Student's t-test for 2 independent samples.
boot.student2(x, y, B = 999)
boot.student2(x, y, B = 999)
x |
A numerical vector with the data. |
y |
A numerical vector with the data. |
B |
The number of bootstrap samples to use. |
We bootstrap Student's (Gosset's) t-test statistic and not the Welch t-test statistic. For the latter case
see the "boot.ttest2" function in Rfast. The difference is that Gosset's test statistic assumes equaility of the
variances, which if violated leads to inlfated type I errors. Bootstrap calibration though takes care of this issue.
As for the bootstrap calibration, instead of sampling B times from each sample, we sample from each
of them and then take all pairs. Each bootstrap sample is independent of each other, hence there is no violation
of the theory (Chatzipantsiou et al., 2019).
A vector with the test statistic and the bootstrap p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Efron Bradley and Robert J. Tibshirani (1993). An introduction to the bootstrap. New York: Chapman & Hall/CRC.
Chatzipantsiou C., Dimitriadis M., Papadakis M. and Tsagris M. (2019). Extremely efficient permutation and bootstrap hypothesis tests using R. To appear in the Journal of Modern Applied Statistical Methods.
https://arxiv.org/ftp/arxiv/papers/1806/1806.10947.pdf
x <- rexp(40, 4) y <- rbeta(50, 2.5, 7.5) t.test(x, y, var.equal = TRUE) boot.student2(x, y, 9999)
x <- rexp(40, 4) y <- rbeta(50, 2.5, 7.5) t.test(x, y, var.equal = TRUE) boot.student2(x, y, 9999)
Censored Weibull regression model.
censweib.reg(y, x, di, tol = 1e-07, maxiters = 100)
censweib.reg(y, x, di, tol = 1e-07, maxiters = 100)
y |
The dependent variable; a numerical vector with strictly positive data, i.e. greater than zero. |
x |
A matrix with the data, where the rows denote the samples (and the two groups) and the columns are the variables. This can be a matrix or a data.frame (with factors). |
di |
A vector with 1s and 0s indicating the censored value. The value of 1 means uncesored value, whereas the value of 0 means censored value. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
The function is written in C++ and this is why it is very fast. No standard errors are returned as they are not corectly estimated. We focused on speed.
When full is FALSE a list including:
iters |
The iterations required by the Newton-Raphson. |
loglik |
The log-likelihood of the model. |
shape |
The shape parameter of the Weibull regression. |
be |
The regression coefficients. |
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
censweibull.mle, km, gumbel.reg
## Not run: x <- matrix(rnorm(100 * 2), ncol = 2) y <- rexp(100, 1) di <- rbinom(100, 1, 0.8) mod <- censweib.reg(y, x, di) x <- NULL ## End(Not run)
## Not run: x <- matrix(rnorm(100 * 2), ncol = 2) y <- rexp(100, 1) di <- rbinom(100, 1, 0.8) mod <- censweib.reg(y, x, di) x <- NULL ## End(Not run)
Check whether a square matrix is skew-symmetric.
is.skew.symmetric(x)
is.skew.symmetric(x)
x |
A square matrix with data. |
Instead of going through the whole matrix, the function will stop if the first disagreement is met.
A boolean value, TRUE of FALSE.
Manos Papadakis.
R implementation and documentation: Manos Papadakis [email protected].
x <-matrix( rnorm( 100 * 400), ncol = 400 ) s1 <- cor(x) is.skew.symmetric(s1) x <- x[1:100, ] is.skew.symmetric(x) x<-s1<-NULL
x <-matrix( rnorm( 100 * 400), ncol = 400 ) s1 <- cor(x) is.skew.symmetric(s1) x <- x[1:100, ] is.skew.symmetric(x) x<-s1<-NULL
Circurlar correlations between two circular variables.
circ.cor1(theta, phi, pvalue = FALSE) circ.cors1(theta, phi, pvalue = FALSE)
circ.cor1(theta, phi, pvalue = FALSE) circ.cors1(theta, phi, pvalue = FALSE)
theta |
The first cirular variable expressed in radians, not degrees. |
phi |
The other cirular variable. In the case of "circ.cors1" this is a matrix with many circular variables. In either case, the values must be in radians, not degrees. |
pvalue |
If you want the p-value of the zero correlation hypothesis testing set this to TRUE, otherwise leave it FALSE. |
Correlation for circular variables using the cosinus and sinus formula of Jammaladaka and SenGupta (1988).
If you set pvalue = TRUE, then for the "circ.cor1" a vector with two values, the correlation and its associated p-value, otherwise the correlation only. For the "circ.cors1", either a vector with the correlations only or a matrix with two columns, the correlation and the p-values.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Jammalamadaka, R. S. and Sengupta, A. (2001). Topics in circular statistics. World Scientific.
Jammalamadaka, S. R. and Sarma, Y. R. (1988) . A correlation coefficient for angular variables. Statistical Theory and Data Analysis, 2:349–364.
y <- runif(50, 0, 2 * pi) x <- runif(50, 0, 2 * pi) circ.cor1(y, x, TRUE) x <- matrix(runif(50 * 10, 0, 2 * pi), ncol = 10) circ.cors1(y, x, TRUE)
y <- runif(50, 0, 2 * pi) x <- runif(50, 0, 2 * pi) circ.cor1(y, x, TRUE) x <- matrix(runif(50 * 10, 0, 2 * pi), ncol = 10) circ.cors1(y, x, TRUE)
Cluster robust wild bootstrap for linear models.
wild.boot(y, x, cluster, ind = NULL, R = 999, parallel = FALSE)
wild.boot(y, x, cluster, ind = NULL, R = 999, parallel = FALSE)
y |
The dependent variable, a numerical vector with numbers. |
x |
A matrix or a data.frame with the indendent variables. |
cluster |
A vector indicating the clusters. |
ind |
A vector with the indices of the variables for which wild bootstrap p-values will be computed. If NULL (default value), the p-values are computed for each variable. |
R |
The number of bootstrap replicates to perform. |
parallel |
Do you want the process to take place in parallel? If yes, then set this equal to TRUE. |
A linear regression model for clustered data is fitted. For more information see Chapter 4.21 of Hansen (2019).
A matrix with 5 columns, the estimated coefficients of the linear model, their cluster robust standard error, their cluster robust test statistic, their cluster robust p-value, and their cluster robust wild bootstrap p-value.
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
Cameron A. Colin, Gelbach J.B., and Miller D.L. (2008). Bootstrap-Based Improvements for Inference with Clustered Errors. The Review of Economics and Statistics 90(3): 414-427.
y <- rnorm(200) id <- sample(1:20, 200, replace = TRUE) x <- matrix( rnorm(200 * 3), ncol = 3 ) wild.boot(y, x, cluster = id)
y <- rnorm(200) id <- sample(1:20, 200, replace = TRUE) x <- matrix( rnorm(200 * 3), ncol = 3 ) wild.boot(y, x, cluster = id)
Column and row-wise jackknife sample means.
coljack.means(x) rowjack.means(x)
coljack.means(x) rowjack.means(x)
x |
A numerical matrix with data. |
An efficient implementation of the jackknife mean is provided.
A vector with the jackknife sample means.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Efron Bradley and Robert J. Tibshirani (1993). An introduction to the bootstrap. New York: Chapman & Hall/CRC.
x <- as.matrix(iris[1:50, 1:4]) coljack.means(x)
x <- as.matrix(iris[1:50, 1:4]) coljack.means(x)
Column-wise means and variances of a matrix.
colmeansvars(x, std = FALSE, parallel = FALSE)
colmeansvars(x, std = FALSE, parallel = FALSE)
x |
A matrix with the data. |
std |
A boolean variable specyfying whether you want the variances (FALSE) or the standard deviations (TRUE) of each column. |
parallel |
A boolean value for parallel version. |
This function cacluates the column-wise means and variances (or standard deviations).
A matrix with two rows. The first contains the means and the second contains the variances (or standard deviations).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Manos Papadakis [email protected].
colmeansvars( as.matrix(iris[, 1:4]) )
colmeansvars( as.matrix(iris[, 1:4]) )
Column-wise MLE of some univariate distributions.
collognorm.mle(x) collogitnorm.mle(x) colborel.mle(x) colhalfnorm.mle(x) colordinal.mle(x, link = "logit") colcauchy.mle(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colbeta.mle(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colunitweibull.mle(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colpowerlaw.mle(x) colsp.mle(x) colhalfcauchy.mle(x, tol = 1e-07, parallel = FALSE, cores = 0) colcensweibull.mle(x, di, tol = 1e-07, parallel = FALSE, cores = 0) colcenspois.mle(x, tol = 1e-07, parallel = FALSE, cores = 0)
collognorm.mle(x) collogitnorm.mle(x) colborel.mle(x) colhalfnorm.mle(x) colordinal.mle(x, link = "logit") colcauchy.mle(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colbeta.mle(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colunitweibull.mle(x, tol = 1e-07, maxiters = 100, parallel = FALSE) colpowerlaw.mle(x) colsp.mle(x) colhalfcauchy.mle(x, tol = 1e-07, parallel = FALSE, cores = 0) colcensweibull.mle(x, di, tol = 1e-07, parallel = FALSE, cores = 0) colcenspois.mle(x, tol = 1e-07, parallel = FALSE, cores = 0)
x |
A numerical matrix with data. Each column refers to a different vector of observations of the same distribution. The values of for lognormal must be greater than zero, for the logitnormal, beta, unit Weibull and sp they must be numbers between 0 and 1, exluding 0 and 1, whereas for the Borel distribution the x must contain integer values greater than 1. For the halfnormal and powerlaw the numbers must be strictly positive, while for the ordinal this can be a numerical matrix with values 1, 2, 3,..., not zeros. The censored Poisson (colcenspois.mle) requires discrete data (counts). |
di |
A vector of 0s (censored) and 1s (not censored) vales. |
link |
This can either be "logit" or "probit". It is the link function to be used. |
tol |
The tolerance value to terminate the Newton-Fisher algorithm. |
maxiters |
The maximum number of iterations to implement. |
parallel |
Do you want to calculations to take place in parallel? The default value is FALSE |
cores |
In case you set parallel = TRUE, then you need to specify the number of cores. |
For each column, the same distribution is fitted and its parameters and log-likelihood are computed.
A matrix with two or three columns. The first one or the first two contain the parameter(s) of the distribution and the second or third column the relevant log-likelihood. For the colordinal.mle() a list including:
param |
A matrix with the intercepts (threshold coefficients) of the model applied to each column (or variable). |
loglik |
The log-likelihood values. |
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
N.L. Johnson, S. Kotz and N. Balakrishnan (1994). Continuous Univariate Distributions, Volume 1 (2nd Edition).
N.L. Johnson, S. Kotz and N. Balakrishnan (1970). Distributions in statistics: continuous univariate distributions, Volume 2.
Agresti A. (2002) Categorical Data. Second edition. Wiley.
J. Mazucheli A. F. B. Menezes L. B. Fernandes R. P. de Oliveira & M. E. Ghitany (2020). The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. Journal of Applied Statistics, 47(6): 954–974.
censpois.mle, gammapois.mle, powerlaw.mle, unitweibull.mle
x <- matrix( exp( rnorm(500 * 50) ), ncol = 50) a <- collognorm.mle(x) x <- NULL
x <- matrix( exp( rnorm(500 * 50) ), ncol = 50) a <- collognorm.mle(x) x <- NULL
Column-wise MLE of the angular Gaussian distribution.
colspml.mle(x ,tol = 1e-07, maxiters = 100, parallel = FALSE)
colspml.mle(x ,tol = 1e-07, maxiters = 100, parallel = FALSE)
x |
A numerical matrix with data. Each column refers to a different vector of observations of the same distribution. The values of for Lognormal must be greater than zero, for the logitnormal they must by percentages, exluding 0 and 1, whereas for the Borel distribution the x must contain integer values greater than 1. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The maximum number of iterations that can take place in each regression. |
parallel |
Do you want this to be executed in parallel or not. The parallel takes place in C++, and the number of threads is defined by each system's availiable cores. |
For each column, spml.mle function is applied that fits the angular Gaussian distribution estimates its parameters and computes the maximum log-likelihood.
A matrix with four columns. The first two are the mean vector, then the parameter, and the fourth
column contains maximum log-likelihood.
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
Presnell Brett, Morrison Scott P. and Littell Ramon C. (1998). Projected multivariate linear models for directional data. Journal of the American Statistical Association, 93(443): 1068-1077.
x <- matrix( runif(100 * 10), ncol = 10) a <- colspml.mle(x) x <- NULL
x <- matrix( runif(100 * 10), ncol = 10) a <- colspml.mle(x) x <- NULL
Column-wise pooled variances across groups.
pooled.colVars(x, ina, std = FALSE)
pooled.colVars(x, ina, std = FALSE)
x |
A matrix with the data. |
ina |
A numerical vector specifying the groups. If you have numerical values, do not put zeros, but 1, 2, 3 and so on. |
std |
A boolean variable specyfying whether you want the variances (FALSE) or the standard deviations (TRUE) of each column. |
This function cacluates the pooled variance (or standard deviation) for a range of groups for each column.
A vector with the pooled column variances or standard deviations.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Manos Papadakis [email protected].
pooled.colVars( as.matrix(iris[, 1:4]), as.numeric(iris[, 5]) )
pooled.colVars( as.matrix(iris[, 1:4]), as.numeric(iris[, 5]) )
Column-wise weighted least squares meta analysis.
colwlsmeta(yi, vi)
colwlsmeta(yi, vi)
yi |
A matrix with the observations. |
vi |
A matrix with the variances of the observations. |
The weighted least squares (WLS) meta analysis is performed in a column-wise fashion. This function is suitable for simulation studies, where one can perform multiple WLS meta analyses at once. See references for this.
A vector with many elements. The fixed effects mean estimate, the
estimate, the
, the
, the Q test statistic and it's p-value,
the
estimate and the random effects mean estimate.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Stanley T. D. and Doucouliagos H. (2015). Neither fixed nor random: weighted least squares meta-analysis. Statistics in Medicine, 34(13), 2116-2127.
y <- matrix( rnorm(50* 5), ncol = 5) vi <- matrix( rexp(50* 5), ncol = 5) colwlsmeta(y, vi) wlsmeta(y[, 1], vi[, 1])
y <- matrix( rnorm(50* 5), ncol = 5) vi <- matrix( rexp(50* 5), ncol = 5) colwlsmeta(y, vi) wlsmeta(y[, 1], vi[, 1])
Conditional least-squares estimate for Poisson INAR(1) models.
pinar1(x, unbiased = FALSE) colpinar1(x, unbiased = FALSE)
pinar1(x, unbiased = FALSE) colpinar1(x, unbiased = FALSE)
x |
Either a numerical vector or a matrix, depending on the function. |
unbiased |
If you want the unbiased estimation select TRUE. |
The function computes the constant and slope coefficients of the Poisson Integer Autoregressive of order 1 (Poisson INAR(1)) model using the conditional least-squares method.
For pinar1() a vector with two values, the coefficient (constant)
and the
coefficient (slope). See references for more information.
For the colpinar1() a matrix with two columns, the coefficient
(constant) and the
coefficient (slope) for each variable (column of x).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
M. Bourguignon and K.L.P. Vasconcellos (2015). Improved estimation for Poisson INAR(1) models. Journal of Statistical Computation and Simulation, 85(12): 2425-2441
x <- rpois(200, 10) pinar1(x)
x <- rpois(200, 10) pinar1(x)
Constrained least squares.
cls(y, x, R, ca)
cls(y, x, R, ca)
y |
The response variables, a numerical vector with observations. |
x |
A matrix with independent variables, the design matrix. |
R |
The R vector that contains the values that will multiply the beta coefficients. See details and examples. |
ca |
The value of the constraint, |
This is described in Chapter 8.2 of Hansen (2019). The idea is to inimise the sum of squares of the residuals under the constraint .
As mentioned above, be careful with the input you give in the x matrix and the R vector.
A list including:
bols |
The OLS (Ordinary Least Squares) beta coefficients. |
bcls |
The CLS (Constrained Least Squares) beta coefficients. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Hansen, B. E. (2022). Econometrics, Princeton University Press.
x <- as.matrix( iris[1:50, 1:4] ) y <- rnorm(50) R <- c(1, 1, 1, 1) cls(y, x, R, 1)
x <- as.matrix( iris[1:50, 1:4] ) y <- rnorm(50) R <- c(1, 1, 1, 1) cls(y, x, R, 1)
Contour plots of some bivariate distributions.
den.contours(x, type = "normal", v = 5)
den.contours(x, type = "normal", v = 5)
x |
A matrix with two columns containing the data. |
type |
The distribution whose contours will appear. This can be "normal", "t" or "mlnorm", standing for the bivariate normal, t and bivariate log-normal. |
v |
The degrees of freedom of the bivariate t distribtuion. |
The contour plot.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
collognorm.mle, halfcauchy.mle
x <- as.matrix(iris[, 1:2]) den.contours(x)
x <- as.matrix(iris[, 1:2]) den.contours(x)
Correlation significance testing using Fisher's z-transformation.
cor_test(y, x, type = "pearson", rho = 0, a = 0.05 )
cor_test(y, x, type = "pearson", rho = 0, a = 0.05 )
y |
A numerical vector. |
x |
A numerical vector. |
type |
The type of correlation you want. "pearson" and "spearman" are the two supported types because their standard error is easily calculated. |
rho |
The value of the hypothesised correlation to be used in the hypothesis testing. |
a |
The significance level used for the confidence intervals. |
The function uses the built-in function "cor" which is very fast, then computes a confidence interval and produces a p-value for the hypothesis test.
A vector with 5 numbers; the correlation, the p-value for the hypothesis test that each of them is
equal to "rho", the test statistic and the lower and upper confidence limits.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- rcauchy(60) y <- rnorm(60) cor_test(y, x)
x <- rcauchy(60) y <- rnorm(60) cor_test(y, x)
Covariance between a variable and a set of variables.
covar(y, x)
covar(y, x)
y |
A numerical vector. |
x |
A numerical matrix. |
The function calculates the covariance between a variable and many others.
A vector with the covariances.
Michail Tsagris and Manos Papadakis.
R implementation and documentation: Michail Tsagris [email protected] and Manos Papadakis [email protected].
y <- rnorm(40) x <- matrix( rnorm(40 * 10), ncol = 10 ) covar(y, x) cov(y, x)
y <- rnorm(40) x <- matrix( rnorm(40 * 10), ncol = 10 ) covar(y, x) cov(y, x)
Cross-validation for the k-NN algorithm for really lage scale data.
bigknn.cv(y, x, k = 5:10, type = "C", folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
bigknn.cv(y, x, k = 5:10, type = "C", folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
y |
A vector of data. The response variable, which can be either continuous or categorical (factor is acceptable). |
x |
A matrix with the available data, the predictor variables. |
k |
A vector with the possible numbers of nearest neighbours to be considered. |
type |
If your response variable y is numerical data, then this should be "R" (regression). If y is in general categorical set this argument to "C" (classification). |
folds |
A list with the indices of the folds. |
nfolds |
The number of folds to be used. This is taken into consideration only if "folds" is NULL. |
stratified |
Do you want the folds to be selected using stratified random sampling? This preserves the analogy of the samples of each group. Make this TRUE if you wish, but only for the classification. If you have regression (type = "R"), do not put this to TRUE as it will cause problems or return wrong results. |
seed |
If you set this to TRUE, the same folds will be created every time. |
pred.ret |
If you want the predicted values returned set this to TRUE. |
The concept behind k-NN is simple. Suppose we have a matrix with predictor variables and a vector with the response variable (numerical or categorical). When a new vector with observations (predictor variables) is available, its corresponding response value, numerical or categorical, is to be predicted. Instead of using a model, parametric or not, one can use this ad hoc algorithm.
The k smallest distances between the new predictor variables and the existing ones are calculated. In the case of regression, the average, median, or harmonic mean of the corresponding response values of these closest predictor values are calculated. In the case of classification, i.e. categorical response value, a voting rule is applied. The most frequent group (response value) is where the new observation is to be allocated.
This function does the cross-validation procedure to select the optimal k, the optimal number of nearest neighbours. The optimal in terms of some accuracy metric. For the classification it is the percentage of correct classification and for the regression the mean squared error.
This function allows for the Euclidean distance only.
A list including:
preds |
If pred.ret is TRUE the predicted values for each fold are returned as elements in a list. |
crit |
A vector whose length is equal to the number of k and is the accuracy metric for each k.
For the classification case it is the percentage of correct classification. For the regression
case the mean square of prediction error. If you want to compute other metrics of accuracy we suggest
you choose "pred.ret = TRUE" when running the function and then write a simple function to compute
more metrics. See |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
Cover TM and Hart PE (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 13(1):21-27.
big.knn, regmlelda.cv, multinomreg.cv
x <- as.matrix(iris[, 1:4]) mod <- bigknn.cv(y = iris[, 5], x = x, k = c(3, 4) )
x <- as.matrix(iris[, 1:4]) mod <- bigknn.cv(y = iris[, 5], x = x, k = c(3, 4) )
Cross-validation for the multinomial regression.
multinomreg.cv(y, x, folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
multinomreg.cv(y, x, folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
y |
The response variable. A numerical or a factor type vector. |
x |
A matrix or a data.frame with the predictor variables. |
folds |
A list with the indices of the folds. |
nfolds |
The number of folds to be used. This is taken into consideration only if "folds" is NULL. |
stratified |
Do you want the folds to be selected using stratified random sampling? This preserves the analogy of the samples of each group. Make this TRUE if you wish, but only for the classification. If you have regression (type = "R"), do not put this to TRUE as it will cause problems or return wrong results. |
seed |
If you set this to TRUE, the same folds will be created every time. |
pred.ret |
If you want the predicted values returned set this to TRUE. |
A list including:
preds |
If pred.ret is TRUE the predicted values for each fold are returned as elements in a list. |
crit |
A vector whose length is equal to the number of k and is the accuracy metric for each k. For the classification case it is the percentage of correct classification. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
Bohning, D. (1992). Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(1): 197-200.
bigknn.cv, mle.lda, reg.mle.lda
x <- as.matrix(iris[, 1:2]) mod <- multinomreg.cv(iris[, 5], x)
x <- as.matrix(iris[, 1:2]) mod <- multinomreg.cv(iris[, 5], x)
Cross-validation for the naive Bayes classifiers.
nb.cv(x, ina, type = "gaussian", folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
nb.cv(x, ina, type = "gaussian", folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
x |
A matrix with the available data, the predictor variables. |
ina |
A vector of data. The response variable, which is categorical (factor is acceptable). |
type |
The type of naive Bayes, "gaussian", "gamma", "weibull", "normlog", "laplace", "cauchy", "logitnorm", "beta", "vm" or "spml", "poisson", "multinom", "geom" or "bernoulli". |
folds |
A list with the indices of the folds. |
nfolds |
The number of folds to be used. This is taken into consideration only if "folds" is NULL. |
stratified |
Do you want the folds to be selected using stratified random sampling? This preserves the analogy of the samples of each group. Make this TRUE if you wish. |
seed |
If you set this to TRUE, the same folds will be created every time. |
pred.ret |
If you want the predicted values returned set this to TRUE. |
A list including:
preds |
If pred.ret is TRUE the predicted values for each fold are returned as elements in a list. |
crit |
A vector whose length is equal to the number of k and is the accuracy metric for each k. For the classification case it is the percentage of correct classification. If you want to compute other metrics of accuracy we suggest you choose "pred.ret = TRUE" when running the function and then write a simple function to compute more metrics. See . |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
weibullnb.pred, weibull.nb, vm.nb, vmnb.pred, mle.lda,
reg.mle.lda, multinom.reg
x <- as.matrix(iris[, 1:4]) mod <- nb.cv(ina = iris[, 5], x = x )
x <- as.matrix(iris[, 1:4]) mod <- nb.cv(ina = iris[, 5], x = x )
Cross-validation for the regularised maximum likelihood linear discriminant analysis.
regmlelda.cv(x, ina, lambda = seq(0, 1, by = 0.1), folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
regmlelda.cv(x, ina, lambda = seq(0, 1, by = 0.1), folds = NULL, nfolds = 10, stratified = TRUE, seed = FALSE, pred.ret = FALSE)
x |
A matrix with numerical data. |
ina |
A numerical vector or factor with consecutive numbers indicating the group to which each observation belongs to. |
lambda |
A vector of regularization values |
folds |
A list with the indices of the folds. |
nfolds |
The number of folds to be used. This is taken into consideration only if "folds" is NULL. |
stratified |
Do you want the folds to be selected using stratified random sampling? This preserves the analogy of the samples of each group. Make this TRUE if you wish, but only for the classification. If you have regression (type = "R"), do not put this to TRUE as it will cause problems or return wrong results. |
seed |
If you set this to TRUE, the same folds will be created every time. |
pred.ret |
If you want the predicted values returned set this to TRUE. |
Cross-validation for the regularised maximum likelihood linear discriminant analysis is performed. The function is not extremely fast, yet is pretty fast.
A list including:
preds |
If pred.ret is TRUE the predicted values for each fold are returned as elements in a list. |
crit |
A vector whose length is equal to the number of k and is the accuracy metric for each k. For the classification case it is the percentage of correct classification. For the regression case the mean square of prediction error. If you want to compute other metrics of accuracy we suggest you choose "pred.ret = TRUE" when running the function and then write a simple function to compute more metrics. See . |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
Cover TM and Hart PE (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 13(1):21-27.
reg.mle.lda, bigknn.cv, mle.lda, big.knn, weibull.nb
x <- as.matrix(iris[, 1:4]) mod <- regmlelda.cv(x, iris[, 5])
x <- as.matrix(iris[, 1:4]) mod <- regmlelda.cv(x, iris[, 5])
Diagonal values of the Hat matrix.
leverage(x)
leverage(x)
x |
A matrix with independent variables, the design matrix. |
The function returns the diagonal values of the Hat matrix used in linear regression. We did not call it "hatvalues" as R contains a built-in function with such a name.
A vector with the diagonal Hat matrix values, the leverage of each observation.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Hansen, B. E. (2019). Econometrics.
x <- as.matrix( iris[1:50, 1:4] ) a <- leverage(x)
x <- as.matrix( iris[1:50, 1:4] ) a <- leverage(x)
Distance between two covariance matrices.
covdist(s1, s2)
covdist(s1, s2)
s1 |
The firt covariance matrix. |
s2 |
The second covariance matrix. |
A metric for covariance matrices is the title of a paper by Forstner and Moonen (2003). The metric is computed for two non-singular covariance matrices.
The distance between the two covariance matrices.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Forstner W. and Moonen B. (2003). A metric for covariance matrices. In Geodesy-The Challenge of the 3rd Millennium, p. 299-309. Springer.
covlikel, covequal, covar, cor_test
s1 <- cov(iris[1:50, 1:4]) s2 <- cov(iris[51:100, 1:4]) covdist(s1, s2)
s1 <- cov(iris[1:50, 1:4]) s2 <- cov(iris[51:100, 1:4]) covdist(s1, s2)
Distance correlation matrix.
dcora(x)
dcora(x)
x |
A numerical matrix. |
The distance correlation matrix is computed.
A matrix with the pairwise distance correlations between all variables in x.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
G.J. Szekely, M.L. Rizzo and N. K. Bakirov (2007). Measuring and Testing Independence by Correlation of Distances. Annals of Statistics, 35(6):2769-2794.
x <- as.matrix( iris[1:50, 1:4] ) res <- dcora(x)
x <- as.matrix( iris[1:50, 1:4] ) res <- dcora(x)
Empirical and exponential empirical likelihood test for a correlation coefficient.
el.cor.test(y, x, rho, tol = 1e-07) eel.cor.test(y, x, rho, tol = 1e-07)
el.cor.test(y, x, rho, tol = 1e-07) eel.cor.test(y, x, rho, tol = 1e-07)
y |
A numerical vector. |
x |
A numerical vector. |
rho |
The hypothesized value of the true partial correlation. |
tol |
The tolerance vlaue to terminate the Newton-Raphson algorithm. |
The empirical or the exponential empirical likelihood test is performed for the Pearson correlation coefficient.
A list including:
iters |
The number of iterations required by the Newton-Raphson. If no convergence occured this is NULL. |
info |
A vector with three values, the value of |
p |
The probabilities of the EL or of the EEL. If no covnergence occured this is NULL. |
Michail Tsagris
R implementation and documentation: Michail Tsagris [email protected].
Efron B. (1981) Nonparametric standard errors and confidence intervals. Canadian Journal of Statistics, 9(2): 139–158.
Owen A. B. (2001). Empirical likelihood. Chapman and Hall/CRC Press.
el.cor.test( iris[, 1], iris[, 2], 0 )$info eel.cor.test( iris[, 1], iris[, 2], 0 )$info
el.cor.test( iris[, 1], iris[, 2], 0 )$info eel.cor.test( iris[, 1], iris[, 2], 0 )$info
Empirical entropy.
empirical.entropy(x, k = NULL, pretty = FALSE)
empirical.entropy(x, k = NULL, pretty = FALSE)
x |
A numerical vector with continuous values. |
k |
If you want to cut the data into a specific range plug it here, otherwise this decide based upon the Freedman-Diaconis' rule. |
pretty |
Should the breaks be equally space upon the range of x? If yes, let this FALSE. If this is TRUE, the breaks are decided using the base command pretty. |
The function computes the empirical entropy.
The estimated empirical entropy.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
https://en.wikipedia.org/wiki/Entropy_estimation
https://en.wikipedia.org/wiki/Histogram
Freedman David and Diaconis P. (1981). On the histogram as a density estimator: L2 theory. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete. 57(4): 453-476.
x <- rnorm(100) empirical.entropy(x) empirical.entropy(x, pretty = TRUE)
x <- rnorm(100) empirical.entropy(x) empirical.entropy(x, pretty = TRUE)
Energy based normality test.
normal.etest(x, R = 999)
normal.etest(x, R = 999)
x |
A numerical vector. |
R |
The number of Monte Carlo samples to generate. |
The energy based normality test is performed where the p-value is computed via parametric bootstrap. The function is faster than the original implementation in the R package "energy".
A vector with two values, the test statistic value and the Monte Carlo (parametric bootstrap) based p-value.
Michail Tsagris
R implementation and documentation: Michail Tsagris <[email protected]>.
Szekely G. J. and Rizzo M.L. (2005) A New Test for Multivariate Normality. Journal of Multivariate Analysis, 93(1): 58–80.
x <- rnorm(100) normal.etest(x)
x <- rnorm(100) normal.etest(x)
Fisher's linear discriminant analysis.
fisher.da(xnew, x, ina)
fisher.da(xnew, x, ina)
xnew |
A numerical vector or a matrix with the new observations, continuous data. |
x |
A matrix with numerical data. |
ina |
A numerical vector or factor with consecutive numbers indicating the group to which each observation belongs to. |
Maximum likelihood linear discriminant analysis is performed.
A vector with the predicted group of each observation in "xnew".
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Kanti V. Mardia, John T. Kent and John M. Bibby (1979). Multivariate analysis. Academic Press, London.
mle.lda, reg.mle.lda, big.knn, weibull.nb
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] a <- fisher.da(x, x, ina)
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] a <- fisher.da(x, x, ina)
Fixed effects regression.
fe.lmfit(y, x, id)
fe.lmfit(y, x, id)
y |
A numerical vector or a numerical matrix. |
x |
A numerical matrix with the predictor variables. |
id |
A vector with the subject ids. This can be factor or a numerical. |
The function performs fixed effects regression (within estimator) for panel (longitudinal) data. It can also handle unblanced designs. A main difference from the package "plm" is that it returns much fewer information, but much faster.
A list including:
be |
The beta coefficients. |
fe |
The fixed effect deviations. |
residuals |
The residuals of the linear model(s). |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
https://www.econometrics-with-r.org/10-rwpd.html
cluster.lm, gee.reg, fipois.reg, wild.boot
y <- rnorm(100) x <- rnorm(100) id <- rep(1:10, 10) mod <- fe.lmfit(y, x, id)
y <- rnorm(100) x <- rnorm(100) id <- rep(1:10, 10) mod <- fe.lmfit(y, x, id)
Fixed intercepts Poisson regression.
fipois.reg(y, x, id, tol = 1e-07, maxiters = 100)
fipois.reg(y, x, id, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with integer, non negative valued data. |
x |
A matrix with the indendent variables. |
id |
A numerical variable with 1, 2, ... indicating the subject. Unbalanced design is of course welcome. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm.
This is set to |
maxiters |
The maximum number of iterations that can take place during the fitting. |
Fixed intercepts Poisson regression for clustered count data is fitted. According to Demidenko (2013), when the
number of clusters (N) is small and the number of observations per cluster () is relatively large,
say
, one may assume that the intercept
is fixed and unknown
(
).
A list including:
be |
The regression coefficients. |
seb |
The standard errors of the regression coefficients. |
ai |
The estimated fixed intercepts fore ach cluster of observations. |
covbeta |
The covariance matrix of the regression coefficients. |
loglik |
The maximised log-likelihood value. |
iters |
The number of iteration the Newton-Raphson required. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Eugene Demidenko (2013). Mixed Models: Theory and Applications with R, pages 388-389, 2nd Edition. New Jersey: Wiley & Sons (excellent book).
cluster.lm, fe.lmfit, gee.reg, covar, welch.tests
y <- rpois(200, 10) id <- sample(1:10, 200, replace = TRUE) x <- rpois(200, 10) fipois.reg(y, x, id)
y <- rpois(200, 10) id <- sample(1:10, 200, replace = TRUE) x <- rpois(200, 10) fipois.reg(y, x, id)
Forward Backward Early Dropping selection regression.
fbed.reg(y, x, alpha = 0.05, type = "logistic", K = 0, backward = FALSE, parallel = FALSE, tol = 1e-07, maxiters = 100)
fbed.reg(y, x, alpha = 0.05, type = "logistic", K = 0, backward = FALSE, parallel = FALSE, tol = 1e-07, maxiters = 100)
y |
The response variable, a numeric vector. |
x |
A matrix with continuous variables. |
alpha |
The significance threshold value for assessing p-values. Default value is 0.05. |
type |
The available types are: "logistic" (binary logistic regression), "qlogistic" (quasi logistic regression, for binary value or proportions including 0 and 1), "poisson" (Poisson regression), "qpoisson" (quasi Poisson regression), "weibull" (Weibull regression) and "spml" (SPML regression). |
K |
How many times should the process be repeated? The default value is 0. |
backward |
After the Forward Early Dropping phase, the algorithm proceeds witha the usual Backward Selection phase. The default value is set to TRUE. It is advised to perform this step as maybe some variables are false positives, they were wrongly selected. This is rather experimental now and there could be some mistakes in the indices of the selected variables. Do not use it for now. |
parallel |
If you want the algorithm to run in parallel set this TRUE. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The maximum number of iterations Newton-Raphson will perform. |
The algorithm is a variation of the usual forward selection. At every step, the most significant variable enters the selected variables set. In addition, only the significant variables stay and are further examined. The non signifcant ones are dropped. This goes until no variable can enter the set. The user has the option to re-do this step 1 or more times (the argument K). In the end, a backward selection is performed to remove falsely selected variables. Note that you may have specified, for example, K=10, but the maximum value FBED used can be 4 for example.
The "qlogistic" and "qpoisson" proceed with the Wald test and no backward is performed, while for all the other regression types, the log-likelihood ratio test is used and backward phase is available.
If K is a single number a list including: Note, that the "gam" argument must be the same though.
res |
A matrix with the selected variables and their test statistic. |
info |
A matrix with the number of variables and the number of tests performed (or models fitted) at each round (value of K). This refers to the forward phase only. |
runtime |
The runtime required. |
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1-39.
logiquant.regs, bic.regs, gee.reg
#simulate a dataset with continuous data x <- matrix( runif(100 * 50, 1, 100), ncol = 50 ) y <- rnbinom(100, 10, 0.5) a <- fbed.reg(y, x, type = "poisson")
#simulate a dataset with continuous data x <- matrix( runif(100 * 50, 1, 100), ncol = 50 ) y <- rnbinom(100, 10, 0.5) a <- fbed.reg(y, x, type = "poisson")
Fractional polynomial regression with one independent variable.
fp(y, x, aa, di = NULL, type = "normal", full = FALSE, seb = FALSE, tol = 1e-07, maxiters = 100)
fp(y, x, aa, di = NULL, type = "normal", full = FALSE, seb = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector. |
x |
A vector, the independent variable. |
aa |
A vector with two values indicating the range of the optimal value of |
di |
This is valid only for the Weibull regression. A vector with 1s and 0s indicating the censored value. The value of 1 means uncesored value, whereas the value of 0 means censored value. |
type |
The type of regression model: "normal", "logistic", "poisson", "spml" (SPML regression), "gamma", "normlog", "weibull", "negbin". |
full |
If this is FALSE, the coefficients and the deviance will be returned only. If this is TRUE, more information is returned. |
seb |
Do you want the standard error of the estimates to be returned? TRUE or FALSE. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
The independent variable is power transformed and this function searches for the optimal power.
A list including:
a |
The power that yields the optimal fit. |
mod |
The model with the independent variable power transformed. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Royston P. and Altman D. G. (1994). Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 43(3): 429-453.
y <- rnorm(100) x <- abs( rnorm(100) ) mod <- fp(y, x, c(-2, 2) )
y <- rnorm(100) x <- abs( rnorm(100) ) mod <- fp(y, x, c(-2, 2) )
Gamma regression with a log-link.
gammareg(y, x, tol = 1e-07, maxiters = 100)
gammareg(y, x, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical variable with non negative numbers. |
x |
A matrix or data.frame with the indendent variables. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The maximum number of iterations that can take place in the regression. |
The gamma.reg fits a Gamma regression with a log-link. The gamma.con fits a Gamma regression with a log link with the intercept only ( glm(y ~ 1, Gamma(log) ) ).
A list including:
iters |
The number of iterations required by the newton-Raphson. |
deviance |
The deviance value. |
phi |
The dispersion parameter ( |
be |
The regression coefficient(s). |
Stefanos Fafalios and Michail Tsagris.
R implementation and documentation: Stefanos Fafalios [email protected] and Michail Tsagris [email protected].
McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
## Not run: y <- rgamma(100, 3, 4) x <- matrix( rnorm(100 * 2), ncol = 2) m1 <- glm(y ~ x, family = Gamma(log) ) m2 <- gammareg(y, x) ## End(Not run)
## Not run: y <- rgamma(100, 3, 4) x <- matrix( rnorm(100 * 2), ncol = 2) m1 <- glm(y ~ x, family = Gamma(log) ) m2 <- gammareg(y, x) ## End(Not run)
GEE Gaussian regression.
gee.reg(y, x, id, tol = 1e-07, maxiters = 100)
gee.reg(y, x, id, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector. |
x |
A matrix with the indendent variables. |
id |
A numerical variable with 1, 2, ... indicating the subject. Unbalanced design is of course welcome. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm.
This is set to |
maxiters |
The maximum number of iterations that can take place during the fitting. |
Gaussin GEE regression is fitted.
A list including:
be |
The regression coefficients. |
seb |
The standard errors of the regression coefficients. |
phi |
The |
a |
The |
covbeta |
The covariance matrix of the regression coefficients. |
iters |
The number of iteration the Newton-Raphson required. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Wang M. (2014). Generalized estimating equations in longitudinal data analysis: a review and recent developments. Advances in Statistics, 2014.
Hardin J. W. and Hilbe J. M. (2002). Generalized estimating equations. Chapman and Hall/CRC.
cluster.lm, fe.lmfit, wild.boot, fipois.reg
y <- rnorm(200) id <- sample(1:20, 200, replace = TRUE) x <- rnorm(200, 3) gee.reg(y, x, id)
y <- rnorm(200) id <- sample(1:20, 200, replace = TRUE) x <- rnorm(200, 3) gee.reg(y, x, id)
Gumbel regression.
gumbel.reg(y, x, tol = 1e-07, maxiters = 100)
gumbel.reg(y, x, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with real valued numbers. |
x |
A matrix or a data.frame with the indendent variables. |
tol |
The tolerance value required by the Newton-Raphson to stop. |
maxiters |
The maximum iterations allowed. |
A Gumbel regression model is fitted. the standard errors of the regressions are not returned as we do not compute the full Hessian matrix at each step of the Newton-Raphson.
A list including:
be |
The regression coefficients. |
sigma |
The scale parameter. |
loglik |
The loglikelihood of the regression model. |
iters |
The iterations required by the Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- rnorm(100) x <- matrix(rnorm(100 * 3), ncol = 3) mod <- gumbel.reg(y, x)
y <- rnorm(100) x <- matrix(rnorm(100 * 3), ncol = 3) mod <- gumbel.reg(y, x)
Hellinger distance based regression for count data.
hellinger.countreg(y, x, tol = 1e-07, maxiters = 100)
hellinger.countreg(y, x, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with integer valued data, counts. |
x |
A numerical matrix with the indendent variables. We add, internally, the first column of ones. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
We minimise the Hellinger distance instead of the ordinarily used divergence, the Kullback-Leibler.
Both of them fall under the -divergence class models and hance this one produces asympottically
normal regression coefficients as well.
A list including:
be |
The regression coefficients. |
seb |
The sandwich standard errors of the coefficients. |
covbe |
The sandwich covariance matrix of the regression coefficients. |
H |
The final Hellinger distance. |
iters |
The number of iterations required by Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- rpois(100, 10) x <- iris[1:100, 1] a <- hellinger.countreg(y, x)
y <- rpois(100, 10) x <- iris[1:100, 1] a <- hellinger.countreg(y, x)
Hellinger distance based univariate regression for proportions.
prophelling.reg(y, x, cov = FALSE, tol = 1e-07, maxiters = 100)
prophelling.reg(y, x, cov = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with percentages. |
x |
A numerical matrix with the indendent variables. We add, internally, the first column of ones. |
cov |
Should the sandwich covariance matrix and the standard errors be returned? If yes, set this equal to TRUE. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
We minimise the Jensen-Shannon divergence instead of the ordinarily used divergence, the Kullback-Leibler.
Both of them fall under the -divergence class models and hance this one produces asympottically
normal regression coefficients as well.
A list including:
be |
The regression coefficients. |
seb |
The sandwich standard errors of the beta coefficients, if the input argument argument was set to TRUE. |
covb |
The sandwich covariance matrix of the beta coefficients, if the input argument argument was set to TRUE. |
js |
The final Jensen-Shannon divergence. |
H |
The final Hellinger distance. |
iters |
The number of iterations required by Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris, Michail (2015). A novel, divergence based, regression for compositional data. Proceedings of the 28th Panhellenic Statistics Conference, 15-18/4/2015, Athens, Greece. https://arxiv.org/pdf/1511.07600.pdf
propols.reg, simplex.mle, kumar.mle
y <- rbeta(150, 3, 4) x <- iris a <- prophelling.reg(y, x)
y <- rbeta(150, 3, 4) x <- iris a <- prophelling.reg(y, x)
Heteroscedastic linear models for large scale data.
het.lmfit(x, y, type = 1)
het.lmfit(x, y, type = 1)
x |
The design matrix with the data, where each column refers to a different sample of subjects. You must supply the design matrix, with the column of 1s. This function is the analogue of lm.fit and .lm.fit. |
y |
A numerical vector with the response variable. |
type |
The type of regression to be fit in order to find the weights. The type 1 is described in Wooldridge (2012, page 287), whereas type 2 is described in page Wooldridge (2012, page 287). |
We have simply exploitted R's powerful function and managed to do better than .lm.fit which is a really powerful function as well. This is a bare bones function as it returns only two things, the coefficients and the residuals. .lm.fit returns more and lm.fit even more and finally lm returns too much. The addition is that we allow for estimation of the regression coefficients when heteroscedasticity is present.
A list including:
be |
The beta coefficients. |
residuals |
The residuals of the linear model(s). |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Introductory Econometrics. A modern approach. Mason, South-Western Cengage Learning, 5th Edition.
Draper, N.R. and Smith H. (1988). Applied regression analysis. New York, Wiley, 3rd edition.
covrob.lm, cls, cluster.lm, lm.parboot, cor_test, lm.drop1
x <- cbind(1, matrix( rnorm( 100 * 4), ncol = 4 ) ) y <- rnorm(100) a <- het.lmfit(x, y) x <- NULL
x <- cbind(1, matrix( rnorm( 100 * 4), ncol = 4 ) ) y <- rnorm(100) a <- het.lmfit(x, y) x <- NULL
Hurdle-Poisson regression.
hp.reg(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
hp.reg(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with numbers. |
x |
A numerical matrix with the indendent variables. We add, internally, the first column of ones. |
full |
If this is FALSE, the coefficients and the log-likelihood will be returned only. If this is TRUE, more information is returned. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
Two regression models are fitted, a binary logistic regression and a zero truncated Poisson regression model.
Depending on whether "full" is TRUE or not different outputs are returned. In general, the regression coefficients, the iterations required by Newton-Raphson and the deviances are returned. If full is TRUE, a matrix with their standard errors and the Wald test statistics is returned as well.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mullahy J (1986). Specification and Testing of Some Modified Count Data Models. Journal of Econometrics, 33(3): 341–365.
y <- rpois(100, 4) x <- iris[1:100, 1] a <- hp.reg(y, x)
y <- rpois(100, 4) x <- iris[1:100, 1] a <- hp.reg(y, x)
Hypothesis test for equality of a covariance matrix.
covequal(x, sigma, a = 0.05)
covequal(x, sigma, a = 0.05)
x |
A numerical matrix with the data whose covariance matrix will be tested for equality. |
sigma |
The covariance matrix that is to be tested for equality. |
a |
The level of significance, default value is equal to 0.05. |
The likelihood-ratio test is used to test whether the sample covariance matrix from some data is equal to some pre-specifief covariance matrix.
A vector with the test statistic, its p-value, the degrees of freedom and the critical value of the test.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mardia K. V., Kent J. T. and Bibby J. M. (1979, pg. 126-127). Multivariate Analysis. London: Academic Press.
covlikel, covdist, covar, cor_test
x <- as.matrix(iris[1:50, 1:4]) sigma <- cov(iris[, 1:4]) covequal(x, sigma)
x <- as.matrix(iris[1:50, 1:4]) sigma <- cov(iris[, 1:4]) covequal(x, sigma)
Hypothesis tests for equality of multiple covariance matrices.
covlikel(x, ina, a = 0.05) covmtest(x, ina, a = 0.05)
covlikel(x, ina, a = 0.05) covmtest(x, ina, a = 0.05)
x |
A numerical matrix with the data whose covariance matrices will be tested for equality. |
ina |
A vector with the grouping variable that defines the groups. |
a |
The level of significance, default value is equal to 0.05. |
The likelihood-ratio test and the Box's M-test for testing equality of multiple
covariance matrices. The log-likelihood ratio test is the multivariate generalization
of Bartlett's test of homogeneity of variances. According to Mardia (1979, pg. 140),
it may be argued that if is small, then the log-likelihood ratio test gives
too much weight to the contribution of
. This consideration led Box (1949)
to propose his test statistic.
A vector with the test statistic, its p-value, the degrees of freedom and the critical value of the test.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Aitchison J. (2003, pg. 155). The Statistical Analysis of Compositional Data. New Jersey: (Reprinted by) The Blackburn Press.
Mardia K. V., Kent J. T. and Bibby J. M. (1979, p.g. 140). Multivariate Analysis. London: Academic Press.
covequal, covdist, covar, cor_test
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] covlikel(x, ina)
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] covlikel(x, ina)
Performs intersection in the same manner as R's base package intersect works.
Intersect(x, y)
Intersect(x, y)
x , y
|
vectors containing a sequence of items, ideally of the same mode |
The function will discard any duplicated values in the arguments.
The function will return a vector of the same mode as the arguments given. NAs will be removed.
Marios Dimitriadis.
R implementation and documentation: Marios Dimitriadis <[email protected]>.
x <- c(sort(sample(1:20, 9))) y <- c(sort(sample(3, 23, 7))) Intersect(x, y)
x <- c(sort(sample(1:20, 9))) y <- c(sort(sample(3, 23, 7))) Intersect(x, y)
Item difficulty and discrimination.
diffic(x) discrim(x, frac = 1/3)
diffic(x) discrim(x, frac = 1/3)
x |
A numerical matrix with 0s (wrong answer) and 1s (correct answer). |
frac |
A number between 0 and 1 used to calculate the difficulty of each question. |
The difficulty and the discrimination of each question (item) are calculated.
A vector with the item difficulties or item discriminations.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Kaplan E. L. and Meier P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282): 457-481.
x <- matrix(rbinom(100 * 10, 1, 0.7), ncol = 10) diffic(x) discrim(x)
x <- matrix(rbinom(100 * 10, 1, 0.7), ncol = 10) diffic(x) discrim(x)
Jackknife sample mean.
jack.mean(x)
jack.mean(x)
x |
A numerical vector with data. |
An efficient implementation of the jackknife mean is provided.
The jackknife sample mean.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Efron Bradley and Robert J. Tibshirani (1993). An introduction to the bootstrap. New York: Chapman & Hall/CRC.
x <- rnorm(50) jack.mean(x)
x <- rnorm(50) jack.mean(x)
Kaplan-Meier estimate of a survival function.
km(ti, di)
km(ti, di)
ti |
A numerical vector with the survival times. |
di |
A numerical vector indicating the censorings. 0 = censored, 1 = not censored. |
The Kaplan-Meier estimate of the survival function takes place.
A matrix with 4 columns. The non censored times, the number of subjects at risk, the number of events at each time and the estimated survival
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Kaplan E. L. and Meier P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282): 457-481.
y <- rgamma(40, 10, 1) di <- rbinom(40, 1, 0.6) a <- km(y, di)
y <- rgamma(40, 10, 1) di <- rbinom(40, 1, 0.6) a <- km(y, di)
Linear model with sandwich robust covariance estimator.
covrob.lm(y, x)
covrob.lm(y, x)
y |
A numerical vector with the response variable. |
x |
The design matrix with the data, where each column refers to a different sample of subjects. You must supply the design matrix, with the column of 1s. This function is the analogue of lm.fit and .lm.fit. |
The function performs the usual linear regression model but returns robust standard errors using the sandwich covariance estimator.
A list including:
info |
A matrix with the beta coefficients, their robust standard error, their t-test statistic, and their associated p-value. |
robcov |
The sandwich robust covariance matrix. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Introductory Econometrics. A modern approach. Mason, South-Western Cengage Learning, 5th Edition.
het.lmfit, cluster.lm, lm.parboot, cor_test, lm.drop1
x <- matrix( rnorm( 100 * 4), ncol = 4 ) y <- rnorm(100) a <- covrob.lm(y, x) x <- NULL
x <- matrix( rnorm( 100 * 4), ncol = 4 ) y <- rnorm(100) a <- covrob.lm(y, x) x <- NULL
Linear regression with clustered data.
cluster.lm(y, x, id)
cluster.lm(y, x, id)
y |
The dependent variable, a numerical vector with numbers. |
x |
A matrix or a data.frame with the indendent variables. |
id |
A numerical variable with 1, 2, ... indicating the subject. Unbalanced design is of course welcome. |
A linear regression model for clustered data is fitted. For more information see Chapter 4.21 of Hansen (2019).
A list including:
be |
The (beta) regression coefficients. |
becov |
Robust covariance matrix of the regression coefficients. |
seb |
Robust standard errors of the regression coefficients. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Hansen, B. E. (2022). Econometrics.
y <- rnorm(200) id <- sample(1:20, 200, replace = TRUE) x <- rnorm(200, 3) cluster.lm(y, x, id)
y <- rnorm(200) id <- sample(1:20, 200, replace = TRUE) x <- rnorm(200, 3) cluster.lm(y, x, id)
Logistic regression for large scale data.
batch.logistic(y, x, k = 10)
batch.logistic(y, x, k = 10)
y |
The dependent variable, a numerical vector with 0s and 1s. |
x |
A matrix with the continuous indendent variables. |
k |
The number of batches to use (see details). |
The batch logistic regression cuts the data into k distinct batches. Then performs logistic regression on each of these batches and the in end combines the coefficients in a meta-analytic form, using the fixed effects form. Using these coefficients, the deviance of the model is computed for all data. This method is pretty accurate for large scale data, with say millions, or even tens of millions of observations.
A list including:
res |
A two-column matrix with the regression coefficients and their associated standard errors. |
devi |
The deviance of the logistic regression. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- rbinom(1000, 1, 0.5) x <- matrix( rnorm(1000 * 5), ncol = 5 ) ## not a very good approximation since the data are not of large scale batch.logistic(y, x, k = 2)
y <- rbinom(1000, 1, 0.5) x <- matrix( rnorm(1000 * 5), ncol = 5 ) ## not a very good approximation since the data are not of large scale batch.logistic(y, x, k = 2)
Mahalanobis depth.
depth.mahala(x, data)
depth.mahala(x, data)
x |
A numerical vector or matrix whose depth you want to compute. |
data |
A numerical matrix used to compute the depth of x. |
This function computes the Mahalanobis depth of x with respect to data.
A numevrical vector with the Mahalanobis depth for each value of x.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mahalanobis P. (1936). On the generalized distance in statistics. Proceedings of the National Academy India, 12 49–55.
Liu R.Y. (1992). Data depth and multivariate rank tests. In Dodge Y. (editors), L1-Statistics and Related Methods, 279–294.
x <- as.matrix(iris[1:50, 1:4]) depth.mahala(x, x)
x <- as.matrix(iris[1:50, 1:4]) depth.mahala(x, x)
It performs very many 2 sample student's t-tests.
stud.ttests(x, y = NULL, ina, logged = FALSE, parallel = FALSE)
stud.ttests(x, y = NULL, ina, logged = FALSE, parallel = FALSE)
x |
A matrix with the data, where the rows denote the samples and the columns are the variables. |
y |
A second matrix with the data of the second group. If this is NULL (default value) then the argument ina must be supplied. Notice that when you supply the two matrices the procedure is two times faster. |
ina |
A numerical vector with 1s and 2s indicating the two groups. Be careful, the function is designed to accept only these two numbers. In addition, if your "y" is NULL, you must specify "ina". |
logged |
Should the p-values be returned (FALSE) or their logarithm (TRUE)? |
parallel |
Should parallel implentations take place in C++? The default value is FALSE. |
For the t-tests, the student's t-test (that assumes equal variances) is performed.
A matrix with the test statistic, the degrees of freedom and the p-value (or their logarithm) of each test.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
"Student" William Sealy Gosset (1908). The probable error of a mean. Biometrika. 6(1): 1-25.
## 1000 variables, hence 20 t-tests will be performed x = matrix( rnorm(100 * 20), ncol = 20) ## 100 observations in total ina = rbinom(100, 1, 0.6) + 1 ## independent samples t-test stud.ttests(x, ina = ina) x1 = x[ina == 1, ] x2 = x[ina == 2, ] stud.ttests(x1, x2) x <- NULL
## 1000 variables, hence 20 t-tests will be performed x = matrix( rnorm(100 * 20), ncol = 20) ## 100 observations in total ina = rbinom(100, 1, 0.6) + 1 ## independent samples t-test stud.ttests(x, ina = ina) x1 = x[ina == 1, ] x2 = x[ina == 2, ] stud.ttests(x1, x2) x <- NULL
Many approximate simple logistic regressions.
sp.logiregs(y, x, logged = FALSE)
sp.logiregs(y, x, logged = FALSE)
y |
The dependent variable, a numerical vector with 0s or 1s. |
x |
A matrix with the indendent variables. |
logged |
Should the p-values be returned (FALSE) or their logarithm (TRUE)? |
Many simple approximate logistic regressions are performed and hypothesis testing for the singificance of each coefficient is returned. The code is available in the paper by Sikorska et al. (2013). We simply took the code and made some minor modifications. The explanation and the motivation can be found in their paper. They call it semi-parallel logistic regressions, hence we named the function sp.logiregs.
A two-column matrix with the test statistics (Wald statistic) and their associated p-values (or their logarithm).
Initial author Karolina Sikorska. Modifications by Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Karolina Sikorska, Emmanuel Lesaffre, Patrick FJ Groenen and Paul HC Eilers (2013): 14:166. GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies. https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-14-166
y <- rbinom(200, 1, 0.5) x <- matrix( rnorm(200 * 50), ncol = 50 ) a <- sp.logiregs(y, x)
y <- rbinom(200, 1, 0.5) x <- matrix( rnorm(200 * 50), ncol = 50 ) a <- sp.logiregs(y, x)
Many binary classification metrics.
colaccs(group, preds) colsens(group, preds) colspecs(group, preds) colprecs(group, preds) colfscores(group, preds) colfbscores(group, preds, b) colfmis(group, preds)
colaccs(group, preds) colsens(group, preds) colspecs(group, preds) colprecs(group, preds) colfscores(group, preds) colfbscores(group, preds, b) colfmis(group, preds)
group |
A numerical vector with two values, 0 and 1. |
preds |
A numerical matrix with scores, probabilities or any other measure. |
b |
The |
The accuracies, sensitivities, specificities, precisions, F-scores, -scores
and the Fowlkes-Mallows index are calculated column-wise. The colaccs is the
only metric that can be used with a multinomial response as well.
A vector with length equal to the number of columns of the "preds" argument containing the relevant values computed for each column.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
https://en.wikipedia.org/wiki/Sensitivity_and_specificity
https://en.wikipedia.org/wiki/Precision_and_recall
colmses, bernoulli.nb, bigknn.cv
## 20 variables, hence 20 accuracies will be calculated ina <- rbinom(100, 1, 0.6) x <- matrix( rnorm(100 * 20), ncol = 20 ) a <- colaccs(ina, x)
## 20 variables, hence 20 accuracies will be calculated ina <- rbinom(100, 1, 0.6) x <- matrix( rnorm(100 * 20), ncol = 20 ) a <- colaccs(ina, x)
Many Gamma regressions.
gammaregs(y, x, tol = 1e-07, logged = FALSE, parallel = FALSE, maxiters = 100)
gammaregs(y, x, tol = 1e-07, logged = FALSE, parallel = FALSE, maxiters = 100)
y |
The dependent variable, a numerical variable with non negative numbers for the Gamma and inverse Gaussian regressions. For the Gaussian with a log-link zero values are allowed. |
x |
A matrix with the indendent variables. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
logged |
A boolean variable; it will return the logarithm of the pvalue if set to TRUE. |
parallel |
Do you want this to be executed in parallel or not. The parallel takes place in C++, therefore you do not have the option to set the number of cores. |
maxiters |
The maximum number of iterations that can take place in each regression. |
Many simple Gamma regressions with a log-link are fitted.
A matrix with the test statistic values and their relevant (logged) p-values.
Stefanos Fafalios and and Michail Tsagris.
R implementation and documentation: Stefanos Fafalios [email protected] and Michail Tsagris [email protected].
McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
Zakariya Yahya Algamal and Intisar Ibrahim Allyas (2017). Prediction of blood lead level in maternal and fetal using generalized linear model. International Journal of Advanced Statistics and Probability, 5(2): 65-69.
## Not run: y <- rgamma(100, 3, 10) x <- matrix( rnorm( 100 * 10), ncol = 10 ) b <- glm(y ~ x[, 1], family = Gamma(log) ) anova(b, test= "F") a <- gammaregs(y, x) x <- NULL ## End(Not run)
## Not run: y <- rgamma(100, 3, 10) x <- matrix( rnorm( 100 * 10), ncol = 10 ) b <- glm(y ~ x[, 1], family = Gamma(log) ) anova(b, test= "F") a <- gammaregs(y, x) x <- NULL ## End(Not run)
Many Jarque-Bera normality tests.
jbtests(x) jbtest(x)
jbtests(x) jbtest(x)
x |
A matrix with the data, where the rows denote the observations and the columns are the variables. In the case of a single sample, then this must be a vector and "jbtest" is to be used. |
The Jarque-Bera univariate normality test is performed for each column (variable) of the matrix x.
A matrix with two columns, or a vector with two elements. Either way, the test statistic value and its associated p-value.
Michail Tsagris
R implementation and documentation: Michail Tsagris <[email protected]>.
Yazici B. and Yolacan S. (2007). A comparison of various tests of normality. Journal of Statistical Computation and Simulation, 77(2): 175–183.
x <- matrix( rnorm(100 * 20), ncol = 20 ) a <- jbtests(x) x <- rnorm(100) jbtest(x)
x <- matrix( rnorm(100 * 20), ncol = 20 ) a <- jbtests(x) x <- rnorm(100) jbtest(x)
any metrics for a continuous response variable.
colmses(y, yhat, parallel = FALSE) colmaes(y, yhat, parallel = FALSE) colpkl(y, yhat, parallel = FALSE) colukl(y, yhat, parallel = FALSE)
colmses(y, yhat, parallel = FALSE) colmaes(y, yhat, parallel = FALSE) colpkl(y, yhat, parallel = FALSE) colukl(y, yhat, parallel = FALSE)
y |
A numerical vector. |
yhat |
A numerical matrix with with the predictions. |
parallel |
If you want parallel computations set this equal to TRUE. |
The mean squared errors, mean absolute errors, and Kullback-Leibler divergence for percentages (colpkl) and non-negative values or discrete values (colukl) are computed.
A vector with length equal to the number of columns of the "yhat" argument containing the relevant values computed for each column.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
colaccs, bigknn.cv, mmpc, pc.sel
## 20 variables, hence 20 MSEs will be calculated y <- rnorm(100, 1, 0.6) yhat <- matrix( rnorm(100 * 20), ncol = 20 ) a <- colmses(y, yhat)
## 20 variables, hence 20 MSEs will be calculated y <- rnorm(100, 1, 0.6) yhat <- matrix( rnorm(100 * 20), ncol = 20 ) a <- colmses(y, yhat)
Many negative binomial regressions.
negbin.regs(y, x, type = 1, tol = 1e-07, logged = FALSE, parallel = FALSE, maxiters = 100)
negbin.regs(y, x, type = 1, tol = 1e-07, logged = FALSE, parallel = FALSE, maxiters = 100)
y |
The dependent variable, a numerical variable with non negative numbers |
x |
A matrix with the indendent variables. |
type |
You can choose which way your prefer. Type 1 is for smal sample sizes, whereas type 2 is for larger ones as is faster. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
logged |
If you want the logarithm of the p-values set this equal to TRUE. |
parallel |
Do you want this to be executed in parallel or not. The parallel takes place in C++, therefore you do not have the option to set the number of cores. |
maxiters |
The maximum number of iterations that can take place in each regression. |
Many simple negative binomial regressions with a log-link are fitted.
A matrix with the test statistic values and their relevant (logged) p-values.
Stefanos Fafalios and and Michail Tsagris.
R implementation and documentation: Stefanos Fafalios [email protected] and Michail Tsagris [email protected].
McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
Zakariya Yahya Algamal and Intisar Ibrahim Allyas (2017). Prediction of blood lead level in maternal and fetal using generalized linear model. International Journal of Advanced Statistics and Probability, 5(2): 65-69.
bic.regs, negbin.reg, score.zipregs, ztp.reg
## Not run: y <- rnbinom(100, 10, 0.6) x <- matrix( rnorm( 100 * 200), ncol = 200 ) a <- negbin.regs(y, x) x <- NULL ## End(Not run)
## Not run: y <- rnbinom(100, 10, 0.6) x <- matrix( rnorm( 100 * 200), ncol = 200 ) a <- negbin.regs(y, x) x <- NULL ## End(Not run)
Many score based regressions with muliple response variables and a single predictor variable.
mv.score.glms(y, x, oiko = NULL, logged = FALSE) mv.score.weibregs(y, x, logged = FALSE) mv.score.betaregs(y, x, logged = FALSE) mv.score.gammaregs(y, x, logged = FALSE) mv.score.expregs(y, x, logged = FALSE) mv.score.invgaussregs(y, x, logged = FALSE)
mv.score.glms(y, x, oiko = NULL, logged = FALSE) mv.score.weibregs(y, x, logged = FALSE) mv.score.betaregs(y, x, logged = FALSE) mv.score.gammaregs(y, x, logged = FALSE) mv.score.expregs(y, x, logged = FALSE) mv.score.invgaussregs(y, x, logged = FALSE)
y |
A matrix with either discrete or binary data for the Poisson or binary logistic regression respectively. For the Weibull, gamma, inverse Gaussian and exponential regressions the values of y must be strictly positive data, lifetimes or durations for example. For the beta regression they must be numbers between 0 and 1. |
x |
A vector with continuous data, the predictor variable. |
oiko |
This can be either "poisson" or "binomial". |
logged |
A boolean variable; it will return the logarithm of the pvalue if set to TRUE. |
Instead of maximising the log-likelihood via the Newton-Raphson algorithm in order to perform the hypothesis
testing that we use the score test. This is dramatcially faster as no model needs to be fitted.
The first derivative (score) of the log-likelihood is known and in closed form and under the null hypothesis
the fitted values are all equal to the mean of the response variable y. The variance of the score is also known
in closed form. The test is not the same as the likelihood ratio test. It is size correct nonetheless but it
is a bit less efficient and less powerful. For big sample sizes though (5000 or more) the results are the same.
We have seen via simulation studies is that it is size correct to large sample sizes, at elast a few thousands.
You can try for yourselves and see that even with 500 the results are pretty close. The score test is pretty faster
then the classical log-likelihood ratio test.
A matrix with two columns, the test statistic and its associated p-value. For the Poisson and logistic
regression the p-value is derived via the t distribution, whereas for all other regression models
via the distribution.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Fafalios S. (2020). Computationally efficient univariate filtering for massive data. Electronic Journal of Applied Statistical Analysis, 13(2):390-412.
Hosmer DW. JR, Lemeshow S. and Sturdivant R.X. (2013). Applied Logistic Regression. New Jersey ,Wiley, 3rd Edition.
Campbell M.J. (2001). Statistics at Square Two: Understand Modern Statistical Applications in Medicine, pg. 112. London, BMJ Books.
McCullagh Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.
score.zipregs, gammaregs, weib.regs
y <- matrix(rbeta(100 * 10, 2, 3), ncol = 10) x <- rnorm(100) a <- mv.score.betaregs(y, x) y <- NULL
y <- matrix(rbeta(100 * 10, 2, 3), ncol = 10) x <- rnorm(100) a <- mv.score.betaregs(y, x) y <- NULL
Many score based zero inflated Poisson regressions.
score.zipregs(y, x, logged = FALSE )
score.zipregs(y, x, logged = FALSE )
y |
A vector with discrete data, counts. |
x |
A matrix with data, the predictor variables. |
logged |
A boolean variable; it will return the logarithm of the pvalue if set to TRUE. |
Instead of maximising the log-likelihood via the Newton-Raphson algorithm in order to perform the hypothesis testing that we use the score test.
This is dramatcially faster as no model need to be fitted. The first derivative of the log-likelihood is known in closed form and under the null hypothesis the
fitted values are all equal to the mean of the response variable y. The test is not the same as the likelihood ratio test. It is size correct nonetheless but it is
a bit less efficient and less powerful. For big sample sizes though (5000 or more) the results are the same. It is also much faster then the classical likelihood ratio test.
A matrix with two columns, the test statistic and its associated (logged) p-value.
Michail Tsagris..
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M., Alenazi A. and Fafalios S. (2020). Computationally efficient univariate filtering for massive data. Electronic Journal of Applied Statistical Analysis, 13(2):390-412.
Lambert D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1):1-14.
Campbell, M.J. (2001). Statistics at Square Two: Understand Modern Statistical Applications in Medicine, pg. 112. London, BMJ Books.
x <- matrix( rnorm(1000 * 100), ncol = 100 ) y <- rpois(1000, 10) y[1:150] <- 0 a <- score.zipregs(y, x) x <- NULL mean(a < 0.05) ## estimated type I error
x <- matrix( rnorm(1000 * 100), ncol = 100 ) y <- rpois(1000, 10) y[1:150] <- 0 a <- score.zipregs(y, x) x <- NULL mean(a < 0.05) ## estimated type I error
Many simple quantile regressions using logistic regressions.
logiquant.regs(y, x, logged = FALSE)
logiquant.regs(y, x, logged = FALSE)
y |
The dependent variable, a numerical vector. |
x |
A matrix with the indendent variables. |
logged |
Should the p-values be returned (FALSE) or their logarithm (TRUE)? |
Instead of fitting quantile regression models, one for each predictor variable and trying to assess its significance, Redden et al. (2004) proposed a simple singificance test based on logistic regression. Create an indicator variable I where 1 indicates a response value above its median and 0 elsewhere. Since I is binary, perform logistic regression for the predictor and assess its significance using the likelihood ratio test. We perform many logistic regression models since we have many predictors whose univariate association with the response variable we want to test.
A two-column matrix with the test statistics (likelihood ratio test statistic) and their associated p-values (or their logarithm).
Author: Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
David T. Redden, Jose R. Fernandez and David B. Allison (2004). A simple significance test for quantile regression. Statistics in Medicine, 23(16): 2587-2597
y <- rcauchy(100, 3, 2) x <- matrix( rnorm(100 * 50), ncol = 50 ) a <- logiquant.regs(y, x)
y <- rcauchy(100, 3, 2) x <- matrix( rnorm(100 * 50), ncol = 50 ) a <- logiquant.regs(y, x)
Many simple Weibull regressions.
weib.regs(y, x, tol = 1e-07, logged = FALSE, parallel = FALSE, maxiters = 100)
weib.regs(y, x, tol = 1e-07, logged = FALSE, parallel = FALSE, maxiters = 100)
y |
The dependent variable, either a numerical variable with numbers greater than zero. |
x |
A matrix with the indendent variables. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
logged |
A boolean variable; it will return the logarithm of the pvalue if set to TRUE. |
parallel |
Do you want this to be executed in parallel or not. The parallel takes place in C++, and the number of threads is defined by each system's availiable cores. |
maxiters |
The maximum number of iterations that can take place in each regression. |
Many simple weibull regressions are fitted.
A matrix with the test statistic values and their associated (logged) p-values.
Stefanos Fafalios.
R implementation and documentation: Stefanos Fafalios [email protected].
y <- rgamma(100, 3, 4) x <- matrix( rnorm( 100 * 30 ), ncol = 30 ) a <- weib.regs(y, x) x <- NULL
y <- rgamma(100, 3, 4) x <- matrix( rnorm( 100 * 30 ), ncol = 30 ) a <- weib.regs(y, x) x <- NULL
Many Welch tests.
welch.tests(y, x, logged = FALSE, parallel = FALSE)
welch.tests(y, x, logged = FALSE, parallel = FALSE)
y |
The dependent variable, a numerical vector. |
x |
A matrix with the indendent variables. They must be integer valued data starting from 1, not 0 and be consecutive numbers. Instead of a data.frame with factor variables, the user must use a matrix with integers. |
logged |
Should the p-values be returned (FALSE) or their logarithm (TRUE)? |
parallel |
If you want to run the function in parallel set this equal to TRUE. |
For each categorical predictor variable, a Welch test is performed. This is useful in feature selection algorithms, to determine for which variable, the means of the dependent variable differ across the different values.
A two-column matrix with the test statistics (F test statistic) and their associated p-values (or their logarithm).
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
B.L. Welch (1951). On the comparison of several mean values: an alternative approach. Biometrika, 38(3/4), 330-336.
y <- rnorm(200) x <- matrix(rbinom(200 * 50, 2, 0.5), ncol = 50) + 1 a <- welch.tests(y, x)
y <- rnorm(200) x <- matrix(rbinom(200 * 50, 2, 0.5), ncol = 50) + 1 a <- welch.tests(y, x)
Max-Min Parents and Children variable selection algorithm for continuous responses.
mmpc(y, x, max_k = 3, alpha = 0.05, method = "pearson", ini = NULL, hash = FALSE, hashobject = NULL, backward = FALSE)
mmpc(y, x, max_k = 3, alpha = 0.05, method = "pearson", ini = NULL, hash = FALSE, hashobject = NULL, backward = FALSE)
y |
The class variable. Provide a numeric vector. |
x |
The main dataset. Provide a numeric matrix. |
max_k |
The maximum conditioning set to use in the conditional independence test. Provide an integer. The default value set is 3. |
alpha |
Threshold for assessing p-values' significance. Provide a double value, between 0.0 and 1.0. The default value set is 0.05. |
method |
Currently only "pearson" is supported. |
ini |
This argument is used for the avoidance of the univariate associations re-calculations, in the case of them being present. Provide it in the form of a list. |
hash |
Boolean value for the activation of the statistics storage in a hash type object. The default value is false. |
hashobject |
This argument is used for the avoidance of the hash re-calculation, in the case of them being present, similarly to ini argument. Provide it in the form of a hash. Please note that the generated hash object should be used only when the same dataset is re-analyzed, possibly with different values of max_k and alpha. |
backward |
Boolean value for the activation of the backward/symmetry correction phase. This option removes and falsely included variables in the parents and children set of the target variable. It calls the |
The MMPC function implements the MMPC algorithm as presented in "Tsamardinos, Brown and Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm" http://www.dsl-lab.org/supplements/mmhc_paper/paper_online.pdf
The output of the algorithm is an list including:
selected |
The order of the selected variables according to the increasing pvalues. |
hashobject |
The hash object containing the statistics calculated in the current run. |
pvalues |
For each feature included in the dataset, this vector reports the strength of its association with the target in the context of all other variables. Particularly, this vector reports the max p-values found when the association of each variable with the target is tested against different conditional sets. Lower values indicate higher association. |
stats |
The statistics corresponding to the aforementioned pvalues (higher values indicate higher association). |
univ |
This is a list with the univariate associations; the test statistics and their corresponding logged p-values. |
max_k |
The max_k value used in the current execution. |
alpha |
The alpha value used in the current execution. |
n.tests |
If hash = TRUE, the number of tests performed will be returned. If hash != TRUE, the number of univariate associations will be returned. |
runtime |
The time (in seconds) that was needed for the execution of algorithm. |
Marios Dimitriadis.
R implementation and documentation: Marios Dimitriadis <[email protected]>.
Tsagris M. and Tsamardinos I. (2019). Feature selection with the R package MXM. F1000Research 7: 1505
Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets, Lagani V. and Athineou G. and Farcomeni A. and Tsagris M. and Tsamardinos I. (2017). Journal of Statistical Software, 80(7).
Tsamardinos, I., Aliferis, C. F. and Statnikov, A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673-678). ACM.
Brown L. E., Tsamardinos, I. and Aliferis C. F. (2004). A novel algorithm for scalable and accurate Bayesian network learning. Medinfo, 711-715.
Tsamardinos, Brown and Aliferis (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1), 31-78.
set.seed(123) # Dataset with continuous data ds <- matrix(runif(100 * 500, 1, 100), ncol = 500) # Class variable tar <- 3 * ds[, 10] + 2 * ds[, 100] + 3 * ds[, 20] + rnorm(100, 0, 5) mmpc(tar, ds, max_k = 3, alpha = 0.05, method = "pearson")
set.seed(123) # Dataset with continuous data ds <- matrix(runif(100 * 500, 1, 100), ncol = 500) # Class variable tar <- 3 * ds[, 10] + 2 * ds[, 100] + 3 * ds[, 20] + rnorm(100, 0, 5) mmpc(tar, ds, max_k = 3, alpha = 0.05, method = "pearson")
Max-Min Parents and Children variable selection algorithm for non continuous responses.
mmpc2(y, x, max_k = 3, threshold = 0.05, test = "logistic", init = NULL, tol = 1e-07, backward = FALSE, maxiters = 100, parallel = FALSE)
mmpc2(y, x, max_k = 3, threshold = 0.05, test = "logistic", init = NULL, tol = 1e-07, backward = FALSE, maxiters = 100, parallel = FALSE)
y |
The response variable, a numeric vector with either count data or binary data. |
x |
A numerical matrix with the independent (predictor) variables. |
max_k |
The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3. |
threshold |
Threshold (suitable values in (0, 1)) for assessing p-values significance. Default value is 0.05. |
test |
One of the following: "logistic", "poisson", "qpoisson". |
init |
A numeric vector with the logged p-values of the univariate associations. Do not use this at the moment. |
tol |
The tolerance value to stop the Newtn-Raphson algorithm inside a regression model. |
backward |
If TRUE, the backward (or symmetry correction) phase will be implemented. This removes any falsely included variables in the parents and children set of the target variable. It calls the |
maxiters |
The maximum number of iterations a Newtn-Raphson algorithm will perform inside a regression model. |
parallel |
Do you want the computations to take part in parallel? Set TRUE if yes. |
MMPC tests each feature for inclusion (selection). It is a variant of the forward selection procedure. a) at every step it removes the non significant variables and does not check thema again. b) Instead of testing a candidate variable conditioning on all previously selected variables, it uses subsets of the previously selected variables. All possible subsets of maximum size equal to max_k. With the approrpiate pre-computations, at every step, it performs only the tests that were not exeucyted before, so it is not that time consuming.
The output of the algorithm is an S3 object including:
selectedVars |
The selected variables, i.e., the signature of the target variable. |
pvalues |
For each feature included in the dataset, this vector reports the strength of its association with the target in the context of all other variable. Particularly, this vector reports the max p-values found when the association of each variable with the target is tested against different conditional sets. Lower values indicate higher association. |
univ |
A vector with the logged p-values of the univariate associations. This vector is very important for subsequent runs of MMPC with different hyper-parameters. After running SES with some hyper-parameters you might want to run MMPCagain with different hyper-parameters. To avoid calculating the univariate associations (first step) again, you can take this list from the first run of SES and plug it in the argument "ini" in the next run(s) of MMPC. This can speed up the second run (and subequent runs of course) by 50%. See the argument "univ" in the output values. |
kapa_pval |
A list with the same number of elements as the max$_k$. Every element in the list is a matrix. The first column is the logged p-values, the second column is the variable whose conditional association with the response variable was tested and the other columns are the conditioning variables. |
max_k |
The max_k option used in the current run. |
threshold |
The threshold option used in the current run. |
n.tests |
The number of tests performed by MMPC will be returned. |
runtime |
The run time of the algorithm. A numeric vector. The first element is the user time, the second element is the system time and the third element is the elapsed time. |
Manos Papadakis.
R implementation and documentation: Manos Papadakis [email protected].
Tsagris M. and Tsamardinos I. (2019). Feature selection with the R package MXM. F1000Research 7: 1505
Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets, Lagani, V. and Athineou, G. and Farcomeni, A. and Tsagris, M. and Tsamardinos, I. (2017). Journal of Statistical Software, 80(7).
Tsamardinos I., Aliferis C. F. and Statnikov, A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673-678). ACM.
Brown L. E., Tsamardinos I. and Aliferis, C. F. (2004). A novel algorithm for scalable and accurate Bayesian network learning. Medinfo, 711-715.
y <- rbinom(100, 1, 0.5) x <- matrix( rnorm(100 * 500), ncol = 500 ) m1 <- mmpc2(y, x, max_k = 3, threshold = 0.05, test = "logistic") m2 <- fbed.reg(y, x, type = "logistic")
y <- rbinom(100, 1, 0.5) x <- matrix( rnorm(100 * 500), ncol = 500 ) m1 <- mmpc2(y, x, max_k = 3, threshold = 0.05, test = "logistic") m2 <- fbed.reg(y, x, type = "logistic")
Maximum likelihood linear discriminant analysis.
mle.lda(xnew, x, ina)
mle.lda(xnew, x, ina)
xnew |
A numerical vector or a matrix with the new observations, continuous data. |
x |
A matrix with numerical data. |
ina |
A numerical vector or factor with consecutive numbers indicating the group to which each observation belongs to. |
Maximum likelihood linear discriminant analysis is performed.
A vector with the predicted group of each observation in "xnew".
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Kanti V. Mardia, John T. Kent and John M. Bibby (1979). Multivariate analysis. Academic Press, London.
fisher.da, reg.mle.lda, big.knn, weibull.nb
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] a <- mle.lda(x, x, ina)
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] a <- mle.lda(x, x, ina)
MLE of continuous univariate distributions defined on the positive line.
halfcauchy.mle(x, tol = 1e-07) powerlaw.mle(x)
halfcauchy.mle(x, tol = 1e-07) powerlaw.mle(x)
x |
A vector with positive valued data (zeros are not allowed). |
tol |
The tolerance level up to which the maximisation stops; set to 1e-09 by default. |
Instead of maximising the log-likelihood via a numerical optimiser we have used a Newton-Raphson algorithm which is faster. See wikipedia for the equations to be solved. For the power law we assume that the minimum value of x is above zero in order to perform the maximum likelihood estimation in the usual way.
Usually a list with three elements, but this is not for all cases.
iters |
The number of iterations required for the Newton-Raphson to converge. |
loglik |
The value of the maximised log-likelihood. |
scale |
The scale parameter of the half Cauchy distribution. |
alpha |
The value of the power parameter for the power law distribution. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
N.L. Johnson, S. Kotz and N. Balakrishnan (1994). Continuous Univariate Distributions, Volume 1 (2nd Edition).
N.L. Johnson, S. Kotz and N. Balakrishnan (1970). Distributions in statistics: continuous univariate distributions, Volume 2
You can also check the relevant wikipedia pages for these distributions.
x <- abs( rcauchy(1000, 0, 2) ) halfcauchy.mle(x)
x <- abs( rcauchy(1000, 0, 2) ) halfcauchy.mle(x)
MLE of distributions defined for proportions.
kumar.mle(x, tol = 1e-07, maxiters = 50) simplex.mle(x, tol = 1e-07) zil.mle(x) unitweibull.mle(x, tol = 1e-07, maxiters = 100) cbern.mle(x, tol = 1e-6) sp.mle(x)
kumar.mle(x, tol = 1e-07, maxiters = 50) simplex.mle(x, tol = 1e-07) zil.mle(x) unitweibull.mle(x, tol = 1e-07, maxiters = 100) cbern.mle(x, tol = 1e-6) sp.mle(x)
x |
A vector with proportions or percentages. Zeros are allowed only for the zero inflated logistirc normal distribution (zil.mle). |
tol |
The tolerance level up to which the maximisation stops. |
maxiters |
The maximum number of iterations the Newton-Raphson will perform. |
The distributions included are the Kumaraswamy, zero inflated logistic normal, simplex, unit Weibull and continuous Bernoulli and standard power. Instead of maximising the log-likelihood via a numerical optimiser we have used a Newton-Raphson algorithm which is faster. See wikipedia for the equations to be solved.
Usually a list with three elements, but this is not for all cases.
iters |
The number of iterations required for the Newton-Raphson to converge. |
param |
The two parameters (shape and scale) of the Kumaraswamy distribution. For the zero inflated logistic normal, the probability of non zeros, the mean and the unbiased variance. |
loglik |
The value of the maximised log-likelihood. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Kumaraswamy P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology 46(1-2): 79-88.
Jones M.C. (2009). Kumaraswamy's distribution: A beta-type distribution with some tractability advantages. Statistical Methodology, 6(1): 70-81.
Mazucheli J., Menezes A.F.B., Fernandes L.B., de Oliveira R.P. and Ghitany M.E. (2020). The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. Journal of Applied Statistics, DOI:10.1080/02664763.2019.1657813
Leemis L.M. and McQueston J.T. (2008). Univariate Distribution Relationships. The American Statistician, 62(1): 45-53.
You can also check the relevant wikipedia pages.
u <- runif(1000) a <- 0.4 ; b <- 1 x <- ( 1 - (1 - u)^(1/b) )^(1/a) kumar.mle(x)
u <- runif(1000) a <- 0.4 ; b <- 1 x <- ( 1 - (1 - u)^(1/b) )^(1/a) kumar.mle(x)
MLE of some circular distributions with multiple samples.
multivm.mle(x, ina, tol = 1e-07, ell = FALSE) multispml.mle(x, ina, tol = 1e-07, ell = FALSE)
multivm.mle(x, ina, tol = 1e-07, ell = FALSE) multispml.mle(x, ina, tol = 1e-07, ell = FALSE)
x |
A numerical vector with the circular data. They must be expressed in radians. For the "spml.mle" this can also be a matrix with two columns, the cosinus and the sinus of the circular data. |
ina |
A numerical vector with discrete numbers starting from 1, i.e. 1, 2, 3, 4,... or a factor variable. Each number denotes a sample or group. If you supply a continuous valued vector the function will obviously provide wrong results. |
tol |
The tolerance level to stop the iterative process of finding the MLEs. |
ell |
Do you want the log-likelihood returned? The default value is FALSE. |
The parameters of the von Mises and of the bivariate angular Gaussian distributions are estimated for multiple samples.
A list including:
iters |
The iterations required until convergence. This is returned in the wrapped Cauchy distribution only. |
loglik |
A vector with the value of the maximised log-likelihood for each sample. |
mi |
For the von Mises, this is a vector with the means of each sample. For the angular Gaussian (spml), a matrix with the mean vector of each sample |
ki |
A vector with the concentration parameter of the von Mises distribution at each sample. |
gi |
A vector with the norm of the mean vector of the angular Gaussian distribution at each sample. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mardia K. V. and Jupp P. E. (2000). Directional statistics. Chicester: John Wiley & Sons.
Sra S. (2012). A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of Is(x). Computational Statistics, 27(1): 177-190.
Presnell Brett, Morrison Scott P. and Littell Ramon C. (1998). Projected multivariate linear models for directional data. Journal of the American Statistical Association, 93(443): 1068-1077.
Kent J. and Tyler D. (1988). Maximum likelihood estimation for the wrapped Cauchy distribution. Journal of Applied Statistics, 15(2): 247–254.
y <- rcauchy(100, 3, 1) x <- y ina <- rep(1:2, 50) multivm.mle(x, ina) multispml.mle(x, ina)
y <- rcauchy(100, 3, 1) x <- y ina <- rep(1:2, 50) multivm.mle(x, ina) multispml.mle(x, ina)
MLE of some truncated distributions.
trunccauchy.mle(x, a, b, tol = 1e-07) truncexpmle(x, b, tol = 1e-07)
trunccauchy.mle(x, a, b, tol = 1e-07) truncexpmle(x, b, tol = 1e-07)
x |
A numerical vector with continuous data. For the Cauchy distribnution, they can be anywhere on the real line. For the exponential distribution they must be strictly positive. |
a |
The lower value at which the Cauchy distribution is truncated. |
b |
The upper value at which the Cauchy or the exponential distribution is truncated. For the exponential this must be greater than zero. |
tol |
The tolerance value to terminate the fitting algorithm. |
Maximum likelihood of some truncated distributions is performed.
A list including:
iters |
The number of iterations reuired by the Newton-Raphson algorithm. |
loglik |
The log-likelihood. |
lambda |
The |
param |
The location and scale parameters in the Cauchy distribution. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
David Olive (2018). Applied Robust Statistics (Chapter 4).
http://lagrange.math.siu.edu/Olive/ol-bookp.htm
x <- rnorm(500) trunccauchy.mle(x, -1, 1)
x <- rnorm(500) trunccauchy.mle(x, -1, 1)
MLE of the Cauchy and generalised normal distributions with zero location.
cauchy0.mle(x, tol = 1e-07) gnormal0.mle(x, tol = 1e-06)
cauchy0.mle(x, tol = 1e-07) gnormal0.mle(x, tol = 1e-06)
x |
A numerical vector with positive real numbers. |
tol |
The tolerance level up to which the maximisation stops set to 1e-07 by default. |
The Cauchy is the t distribution with 1 degree of freedom. The cauchy0.mle estimates the usual Cauchy distribution, over the real line, but assumes a zero location. The gnormal0.mle estimates the generalised normal distribution assuming a zero location. The generalised normal distribution is also known as the exponential power distribution or the generalized error distribution.
A list including:
iters |
The number of iterations required by the Newton-Raphson algorithm. |
loglik |
The value of the maximised log-likelihood. |
scale |
The estimated scale parameter of the Cauchy distribution. |
param |
The estimated scale and shape parameters of the generalised normal distribution. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Do M.N. and Vetterli M. (2002). Wavelet-based Texture Retrieval Using Generalised Gaussian Density and Kullback-Leibler Distance. Transaction on Image Processing. 11(2): 146-158.
x <- rcauchy(150, 0, 2) cauchy0.mle(x) x <- rnorm(200) gnormal0.mle(x)
x <- rcauchy(150, 0, 2) cauchy0.mle(x) x <- rnorm(200) gnormal0.mle(x)
MLE of the censored Weibull distribution.
censweibull.mle(x, di, tol = 1e-07)
censweibull.mle(x, di, tol = 1e-07)
x |
A vector with positive valued data (zeros are not allowed). |
di |
A vector of 0s (censored) and 1s (not censored) vales. |
tol |
The tolerance level up to which the maximisation stops; set to 1e-07 by default. |
Instead of maximising the log-likelihood via a numerical optimiser we have used a Newton-Raphson algorithm which is faster.
A list including:
iters |
The number of iterations required for the Newton-Raphson to converge. |
loglik |
The value of the maximised log-likelihood. |
param |
The vector of the parameters. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Fritz Scholz (1996). Maximum Likelihood Estimation for Type I Censored Weibull Data Including Covariates. Technical report. ISSTECH-96-022, Boeing Information & Support Services, P.O. Box 24346, MS-7L-22.
http://faculty.washington.edu/fscholz/Reports/weibcensmle.pdf
x <- rweibull(300, 3, 6) censweibull.mle(x, di = rep(1, 300)) di <- rbinom(300, 1, 0.9) censweibull.mle(x, di)
x <- rweibull(300, 3, 6) censweibull.mle(x, di = rep(1, 300)) di <- rbinom(300, 1, 0.9) censweibull.mle(x, di)
MLE of the gamma-Poisson distribution.
gammapois.mle(x, tol = 1e-07)
gammapois.mle(x, tol = 1e-07)
x |
A numerical vector with positive data and zeros. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
MLE of the gamma-Poisson distribution is fitted. When the rate in the Poisson follows a gamma
distribution with shape = r and scale , the resulting distribution is the gamm-Poisson.
If the shape r is integer, the distribution is called negative binomial distribution.
A list including:
iters |
The iterations required by the Newton-Raphson to estimate the parameters of the distribution for the non zero data. |
loglik |
The full log-likelihood of the model. |
param |
The parameters of the model. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Johnson Norman L., Kotz Samuel and Kemp Adrienne W. (1992). Univariate Discrete Distributions (2nd ed.). Wiley.
x <- rnbinom(200, 20, 0.7) gammapois.mle(x)
x <- rnbinom(200, 20, 0.7) gammapois.mle(x)
MLE of the left censored Poisson distribution.
censpois.mle(x, tol = 1e-07)
censpois.mle(x, tol = 1e-07)
x |
A vector with positive valued data (zeros are not allowed). |
tol |
The tolerance level up to which the maximisation stops; set to 1e-07 by default. |
Instead of maximising the log-likelihood via a numerical optimiser we have used a Newton-Raphson algorithm which is faster. The lowest value in x is taken as the censored point. Values below are censored.
A list including:
iters |
The number of iterations required for the Newton-Raphson to converge. |
loglik |
The value of the maximised log-likelihood. |
lambda |
The estimated |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x1 <- rpois(10000,15) x <- x1 x[x<=10] = 10 mean(x) censpois.mle(x)$lambda
x1 <- rpois(10000,15) x <- x1 x[x<=10] = 10 mean(x) censpois.mle(x)$lambda
MLE of the Purkayastha distribution.
purka.mle(x, tol = 1e-07)
purka.mle(x, tol = 1e-07)
x |
A numerical vector with data expressed in radians or a matrix with spherical data. |
tol |
The tolerance value to terminate the Brent algorithm. |
MLE of the Purkayastha distribution is performed.
A list including:
theta |
The median direction. |
alpha |
The concentration parameter. |
loglik |
The log-likelihood. |
alpha.sd |
The standard error of the concentration parameter. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Purkayastha S. (1991). A Rotationally Symmetric Directional Distribution: Obtained through Maximum Likelihood Characterization. The Indian Journal of Statistics, Series A, 53(1): 70-83
Cabrera J. and Watson G. S. (1990). On a spherical median related distribution. Communications in Statistics-Theory and Methods, 19(6): 1973-1986.
x <- cbind( rnorm(100,1,1), rnorm(100, 2, 1) ) x <- x / sqrt(rowSums(x^2)) purka.mle(x)
x <- cbind( rnorm(100,1,1), rnorm(100, 2, 1) ) x <- x / sqrt(rowSums(x^2)) purka.mle(x)
MLE of the zero inflated Gamma and Weibull distributions.
zigamma.mle(x, tol = 1e-07) ziweibull.mle(x, tol = 1e-07)
zigamma.mle(x, tol = 1e-07) ziweibull.mle(x, tol = 1e-07)
x |
A numerical vector with positive data and zeros. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
MLE of some zero inflated models is performed.
A list including:
iters |
The iterations required by the Newton-Raphson to estimate the parameters of the distribution for the non zero data. |
loglik |
The full log-likelihood of the model. |
param |
The parameters of the model. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Sandra Taylor and Katherine Pollard (2009). Hypothesis Tests for Point-Mass Mixture Data with Application to Omics Data with Many Zero Values. Statistical Applications in Geneticsand Molecular Biology, 8(1): 1–43.
Kalimuthu Krishnamoorthy, Meesook Lee and Wang Xiao (2015). Likelihood ratio tests for comparing several gamma distributions. Environmetrics, 26(8):571-583.
x <- rgamma(200, 4, 1) x[sample(1:200, 20)] <- 0 zigamma.mle(x)
x <- rgamma(200, 4, 1) x[sample(1:200, 20)] <- 0 zigamma.mle(x)
Monte Carlo Integration with a normal distribution.
mci(fun, R = 10^6)
mci(fun, R = 10^6)
fun |
A function denoting the inside part of the expectation to be computed. |
R |
The number of draws from the normal distribution. |
The result of the integral.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Morgan B. J. (2018). Elements of simulation. Chapman & Hall/CRC.
## compute the expectation of abs(x) fun <- function(x) abs(x) mci(fun, R = 10^5) a <- function(x) abs(x) * dnorm(x) integrate(a, -Inf, Inf)
## compute the expectation of abs(x) fun <- function(x) abs(x) mci(fun, R = 10^5) a <- function(x) abs(x) * dnorm(x) integrate(a, -Inf, Inf)
Moran's I measure of spatial autocorrelation.
moranI(x, w, scaled = FALSE, R = 999)
moranI(x, w, scaled = FALSE, R = 999)
x |
A numerical vector with observations. |
w |
The inverse of a (symmetric) distance matrix. After computing the distance matrix, you invert all its elements and the elements which are zero (diagonal) and have become Inf. set them to 0. This is the w matrix the functions requires. If you want an extra step, you can row standardise this matrix by dividing each row by its total. This will makw the rowsums equal to 1. |
scaled |
If the matrix is row-standardised (all rowsums are equal to 1) then this is TRUE and FALSE otherwise. |
R |
The number of permutations to use in order to obtain the permutation based-pvalue. If R is 1 or less no permutation p-value is returned. |
Moran' I index is a measure of spatial autocorrelation. that was proposed in 1950. Instead of computing an asymptotic p-value we compute a permutation based p-value utilizing the fast method of Chatzipantsiou et al. (2019).
A vector of two values, the Moran's I index and its permutation based p-value. If R is 1 or less no permutation p-value is returned, and the second element is "NA".
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Moran, P. A. P. (1950). Notes on Continuous Stochastic Phenomena. Biometrika. 37(1): 17-23.
Chatzipantsiou C., Dimitriadis M., Papadakis M. and Tsagris M. (2019). Extremely efficient permutation and bootstrap hypothesis tests using R. Journal of Modern Applied Statistical Methods (To appear). https://arxiv.org/ftp/arxiv/papers/1806/1806.10947.pdf
x <- rnorm(50) w <- as.matrix( dist(iris[1:50, 1:4]) ) w <- 1/w diag(w) <- 0 moranI(x, w, scaled = FALSE)
x <- rnorm(50) w <- as.matrix( dist(iris[1:50, 1:4]) ) w <- 1/w diag(w) <- 0 moranI(x, w, scaled = FALSE)
Multinomial regression.
multinom.reg(y, x, tol = 1e-07, maxiters = 100)
multinom.reg(y, x, tol = 1e-07, maxiters = 100)
y |
The response variable. A numerical or a factor type vector. |
x |
A matrix or a data.frame with the predictor variables. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The maximum number of iterations Newton-Raphson will perform. |
A list including:
iters |
The number of iterations required by the Newton-Raphson. |
loglik |
The value of the maximised log-likelihood. |
be |
A matrix with the estimated regression coefficients. |
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
Bohning, D. (1992). Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(1): 197-200.
y <- iris[, 5] x <- matrix( rnorm(150 * 2), ncol = 2 ) mod <- multinom.reg(y, x)
y <- iris[, 5] x <- matrix( rnorm(150 * 2), ncol = 2 ) mod <- multinom.reg(y, x)
Naive Bayes classifier for binary (Bernoulli) data.
bernoulli.nb(xnew = NULL, x, ina)
bernoulli.nb(xnew = NULL, x, ina)
xnew |
A numerical matrix with new predictor variables whose group is to be predicted. Each column contains binary (0 or 1) data. |
x |
A numerical matrix with observed predictor variables. Each column contains binary (0 or 1) data. |
ina |
A numerical vector with strictly positive numbers, i.e. 1,2,3 indicating the groups of the dataset. Alternatively this can be a factor variable. |
Each column is supposed to contain binary data. Thus, for each column a Berboulli distributions is fitted. The product of the densities is the joint multivariate distribution.
A list including:
pi |
A matrix with the estimated probabilities of each group and variable. |
ni |
The sample size of each group in the dataset. |
est |
The estimated group of the xnew observations. It returns a numerical value back regardless of the target variable being numerical as well or factor. Hence, it is suggested that you do \"as.numeric(ina)\" in order to see what is the predicted class of the new data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( rbinom(50 * 4, 1, 0.5), ncol = 4 ) ina <- rbinom(50, 1, 0.5) + 1 a <- bernoulli.nb(x, x, ina)
x <- matrix( rbinom(50 * 4, 1, 0.5), ncol = 4 ) ina <- rbinom(50, 1, 0.5) + 1 a <- bernoulli.nb(x, x, ina)
Naive Bayes classifiers.
weibull.nb(xnew = NULL, x, ina, tol = 1e-07) normlog.nb(xnew = NULL, x, ina) laplace.nb(xnew = NULL, x, ina) logitnorm.nb(xnew = NULL, x, ina) beta.nb(xnew = NULL, x, ina) cauchy.nb(xnew = NULL, x, ina)
weibull.nb(xnew = NULL, x, ina, tol = 1e-07) normlog.nb(xnew = NULL, x, ina) laplace.nb(xnew = NULL, x, ina) logitnorm.nb(xnew = NULL, x, ina) beta.nb(xnew = NULL, x, ina) cauchy.nb(xnew = NULL, x, ina)
xnew |
A numerical matrix with new predictor variables whose group is to be predicted. This is set to NUUL, as you might want just the model and not to predict the membership of new observations. For the normlog this contains positive numbers (greater than or equal to zero), but for the multinomial and Poisson cases, the matrix must contain integer valued numbers only. For the logistic normal (logitnorm.nb) and beta (beta.nb) the data must be strictly between 0 and 1. |
x |
A numerical matrix with the observed predictor variable values. For the Gaussian case (normlognb.nb) this contains positive numbers (greater than or equal to zero), but for the multinomial and Poisson cases, the matrix must contain integer valued numbers only. For the logistic normal (logitnorm.nb) and beta (beta.nb) the data must be strictly between 0 and 1. |
ina |
A numerical vector with strictly positive numbers, i.e. 1,2,3 indicating the groups of the dataset. Alternatively this can be a factor variable. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm in the Weibull distribution. |
Depending on the classifier a list including (the ni and est are common for all classifiers):
shape |
A matrix with the shape parameters. |
scale |
A matrix with the scale parameters. |
expmu |
A matrix with the mean parameters. |
sigma |
A matrix with the (MLE, hence biased) variance parameters. |
location |
A matrix with the location parameters (medians). |
scale |
A matrix with the scale parameters. |
mean |
A matrix with the scale parameters. |
var |
A matrix with the variance parameters. |
a |
A matrix with the "alpha" parameters. |
b |
A matrix with the "beta" parameters. |
ni |
The sample size of each group in the dataset. |
est |
The estimated group of the xnew observations. It returns a numerical value back regardless of the target variable being numerical as well or factor. Hence, it is suggested that you do \"as.numeric(ina)\" in order to see what is the predicted class of the new data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
weibullnb.pred, vm.nb, nb.cv, mle.lda, big.knn
x <- matrix( rweibull( 100, 3, 4 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- weibull.nb(x, x, ina)
x <- matrix( rweibull( 100, 3, 4 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- weibull.nb(x, x, ina)
Naive Bayes classifiers for directional data.
vm.nb(xnew = NULL, x, ina, tol = 1e-07) spml.nb(xnew = NULL, x, ina, tol = 1e-07)
vm.nb(xnew = NULL, x, ina, tol = 1e-07) spml.nb(xnew = NULL, x, ina, tol = 1e-07)
xnew |
A numerical matrix with new predictor variables whose group is to be predicted. Each column refers to an angular variable. |
x |
A numerical matrix with observed predictor variables. Each column refers to an angular variable. |
ina |
A numerical vector with strictly positive numbers, i.e. 1,2,3 indicating the groups of the dataset. Alternatively this can be a factor variable. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
Each column is supposed to contain angular measurements. Thus, for each column a von Mises distribution or an circular angular Gaussian distribution is fitted. The product of the densities is the joint multivariate distribution.
A list including:
mu |
A matrix with the mean vectors expressed in radians. |
mu1 |
A matrix with the first set of mean vectors. |
mu2 |
A matrix with the second set of mean vectors. |
kappa |
A matrix with the kappa parameters for the vonMises distribution or with the norm of the mean vectors for the circular angular Gaussian distribution. |
ni |
The sample size of each group in the dataset. |
est |
The estimated group of the xnew observations. It returns a numerical value back regardless of the target variable being numerical as well or factor. Hence, it is suggested that you do \"as.numeric(ina)\" in order to see what is the predicted class of the new data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( runif( 100, 0, 1 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- vm.nb(x, x, ina)
x <- matrix( runif( 100, 0, 1 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- vm.nb(x, x, ina)
Negative binomial regression.
negbin.reg(y, x, tol = 1e-07, maxiters = 100)
negbin.reg(y, x, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with integer valued numbers. |
x |
A matrix or a data.frame with the indendent variables. |
tol |
The tolerance value required by the Newton-Raphson to stop. |
maxiters |
The maximum iterations allowed. |
A negative binomial regression model is fitted. The standard errors of the regressions are not returned as we do not compute the full Hessian matrix at each step of the Newton-Raphson.
A list including:
be |
The regression coefficients. |
loglik |
The loglikelihood of the regression model. |
iters |
The iterations required by the Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
y <- rnbinom(100, 10, 0.7) x <- matrix( rnorm(100 * 3), ncol = 3 ) mod <- negbin.reg(y, x)
y <- rnbinom(100, 10, 0.7) x <- matrix( rnorm(100 * 3), ncol = 3 ) mod <- negbin.reg(y, x)
Non linear least squares regression for percentages or proportions.
propols.reg(y, x, cov = FALSE, tol = 1e-07 ,maxiters = 100)
propols.reg(y, x, cov = FALSE, tol = 1e-07 ,maxiters = 100)
y |
The dependent variable, a numerical vector with percentages or proporions, including 0s and or 1s. |
x |
A matrix with the indendent variables. |
cov |
Should the sandwich covariance matrix and the standard errors be returned? If yes, set this equal to TRUE. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm.
This is set to |
maxiters |
The maximum number of iterations that can take place during the fitting. |
The ordinary least squares between the observed and the fitted percentages is adopted as the objective function. This involves numerical optimization since the relationship is non-linear. There is no log-likelihood. This is the univariate version of the OLS regression for compositional data mentioned in Murteira and Ramalho (2016).
A list including:
sse |
The sum of squares of the raw residuals. |
be |
The beta coefficients. |
seb |
The sandwich standard errors of the beta coefficients, if the input argument argument was set to TRUE. |
covb |
The sandwich covariance matrix of the beta coefficients, if the input argument argument was set to TRUE. |
iters |
The number of iterations required by the Newton-Raphson algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Murteira, Jose MR, and Joaquim JS Ramalho 2016. Regression analysis of multivariate fractional data. Econometric Reviews 35(4): 515-552.
prophelling.reg, simplex.mle, kumar.mle
y <- rbeta(150, 3, 4) x <- iris a <- propols.reg(y, x)
y <- rbeta(150, 3, 4) x <- iris a <- propols.reg(y, x)
One sample bootstrap t-test for a vector.
boot.ttest1(x, m, R = 999)
boot.ttest1(x, m, R = 999)
x |
A numerical vector with the data. |
m |
The assumed mean value. |
R |
The number of bootstrap resamples to draw. |
The usual one sample bootstrap t-test is implemented, only faster.
res |
A two valued vector with the test statistic and its p-value. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected]
boot.student2, perm.ttest2, welch.tests, jack.mean
x <- rexp(30) a <- t.test(x, mu = 0) b <- boot.ttest1(x, 0)
x <- rexp(30) a <- t.test(x, mu = 0) b <- boot.ttest1(x, 0)
Orthogonal matching variable selection.
omp2(y, x, xstand = TRUE, tol = qchisq(0.95, 1), type = "gamma" )
omp2(y, x, xstand = TRUE, tol = qchisq(0.95, 1), type = "gamma" )
y |
The response variable, a numeric vector. For "omp" this can be either a vector with discrete (count) data, 0 and 1, non negative values, strictly positive or a factor (categorical) variable. |
x |
A matrix with the data, where the rows denote the observations and the columns are the variables. |
xstand |
If this is TRUE the independent variables are standardised. |
tol |
The tolerance value to terminate the algorithm. This is the change in the criterion value
between two successive steps. For "ompr" the default value is 2 because the default method
is "BIC". The default value is the 95% quantile of the |
type |
This denotes the parametric model to be used each time. It depends upon the nature of y. The possible values are "gamma", "negbin", or "multinomial". |
This is the continuation of the "omp" function of the Rfast. We added some more regression models. The "gamma" and the "multinomial" models have now been implemented in C++.
A list including:
runtime |
The runtime of the algorithm. |
info |
A matrix with two columns. The selected variable(s) and the criterion value at every step. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Pati Y. C., Rezaiifar R. and Krishnaprasad P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on. IEEE.
Mazin Abdulrasool Hameed (2012). Comparative analysis of orthogonal matching pursuit and least angle regression. MSc thesis, Michigan State University. https://www.google.gr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwik9P3Yto7XAhUiCZoKHQ8XDr8QFgglMAA&url=https
Lozano A., Swirszcz G. and Abe N. (2011). Group orthogonal matching pursuit for logistic regression. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
The -OMP algorithm for feature selection with application to gene expression data.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(2): 1214-1224.
https://arxiv.org/pdf/2004.00281.pdf
x <- matrix( rnorm(100 * 50), ncol = 50 ) y <- rgamma(100, 4, 1) a <- omp2(y, x) a x <- NULL
x <- matrix( rnorm(100 * 50), ncol = 50 ) y <- rgamma(100, 4, 1) a <- omp2(y, x) a x <- NULL
Parametric and non-parametric bootstrap for linear regression model.
lm.parboot(x, y, R = 1000) lm.nonparboot(x, y, R = 1000)
lm.parboot(x, y, R = 1000) lm.nonparboot(x, y, R = 1000)
x |
The predictor variables, a vector or a matrix or a data frame. |
y |
The response variable, a numerical vector with data. |
R |
The number of parametric bootstrap replications to perform. |
An efficient implementation of parametric or non-parametric bootstrapping the residuals for linear models is provided.
A matrix with R columns and rows equal to the number of the regression parameters. Each column contains the set of a bootstrap beta regression coefficients.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Efron Bradley and Robert J. Tibshirani (1993). An introduction to the bootstrap. New York: Chapman & Hall/CRC.
lm.drop1, leverage, pc.sel, mmpc
y <- rnorm(50) x <- matrix( rnorm( 50 * 2), ncol = 2 ) a <- lm.parboot(x, y, 500)
y <- rnorm(50) x <- matrix( rnorm( 50 * 2), ncol = 2 ) a <- lm.parboot(x, y, 500)
Permutation t-test for 2 independent samples.
perm.ttest2(x, y, B = 999)
perm.ttest2(x, y, B = 999)
x |
A numerical vector with the data. |
y |
A numerical vector with the data. |
B |
The number of permutations to perform. |
The usual permutation based p-value is computed.
A vector with the test statistic and the permutation based p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Good P. I. (2005). Permutation, parametric and bootstrap tests of hypotheses: a practical guide to resampling methods for testing hypotheses. Springer 3rd Edition.
x <- rexp(30, 4) y <- rbeta(30, 2.5, 7.5) perm.ttest2(x, y, 999)
x <- rexp(30, 4) y <- rbeta(30, 2.5, 7.5) perm.ttest2(x, y, 999)
Prediction with naive Bayes classifier for binary (Bernoulli) data.
bernoullinb.pred(xnew, pi, ni)
bernoullinb.pred(xnew, pi, ni)
xnew |
A numerical matrix with new predictor variables whose group is to be predicted. Each column refers to an angular variable. |
pi |
A matrix with the estimated probabilities of each group. |
ni |
The sample size of each group in the dataset. |
Each column is supposed to contain binary data. Thus, for each column a Berboulli distributions is fitted. The product of the densities is the joint multivariate distribution.
A numerical vector with 1, 2, ... denoting the predicted group.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( rbinom(50 * 4, 1, 0.5), ncol = 4 ) ina <- rbinom(50, 1, 0.5) + 1 a <- bernoulli.nb(x, x, ina)
x <- matrix( rbinom(50 * 4, 1, 0.5), ncol = 4 ) ina <- rbinom(50, 1, 0.5) + 1 a <- bernoulli.nb(x, x, ina)
Prediction with some naive Bayes classifiers.
weibullnb.pred(xnew, shape, scale, ni) normlognb.pred(xnew, expmu, sigma, ni) laplacenb.pred(xnew, location, scale, ni) logitnormnb.pred(xnew, m, s, ni) betanb.pred(xnew, a, b, ni) cauchynb.pred(xnew, location, scale, ni)
weibullnb.pred(xnew, shape, scale, ni) normlognb.pred(xnew, expmu, sigma, ni) laplacenb.pred(xnew, location, scale, ni) logitnormnb.pred(xnew, m, s, ni) betanb.pred(xnew, a, b, ni) cauchynb.pred(xnew, location, scale, ni)
xnew |
A numerical matrix with new predictor variables whose group is to be predicted. For the Gaussian naive Bayes, this is set to NUUL, as you might want just the model and not to predict the membership of new observations. For the Gaussian case this contains positive numbers (greater than or equal to zero), but for the multinomial and Poisson cases, the matrix must contain integer valued numbers only. For the logistic normal (logitnormnb.pred) the data must be percentages strictly between 0 and 1. |
shape |
A matrix with the group shape parameters. Each row corresponds to a group. |
scale |
A matrix with the group scale parameters of the Laplace or the Cauchy distribution. Each row corresponds to a group. |
expmu |
A matrix with the group mean parameters. Each row corresponds to a group. |
m |
A matrix with the group mean parameters. Each row corresponds to a group. |
sigma |
A matrix with the group (MLE, hence biased) variance parameters. Each row corresponds to a group. |
s |
A matrix with the group MLE variance parameters. Each row corresponds to a group. |
location |
A matrix with the group location parameters of the Laplace or of the Cauchy distribution. Each row corresponds to a group. |
a |
A matrix with the group "alpha" parameters of the beta distribution. Each row corresponds to a group. |
b |
A matrix with the group "beta" parameters of the beta distribution. Each row corresponds to a group. |
ni |
A vector with the frequencies of each group. |
A numerical vector with 1, 2, ... denoting the predicted group.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( rweibull( 100, 3, 4 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- weibull.nb(x, x, ina) est <- weibullnb.pred(x, a$shape, a$scale, a$ni) table(ina, est)
x <- matrix( rweibull( 100, 3, 4 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- weibull.nb(x, x, ina) est <- weibullnb.pred(x, a$shape, a$scale, a$ni) table(ina, est)
Prediction with some naive Bayes classifiers for circular data.
vmnb.pred(xnew, mu, kappa, ni) spmlnb.pred(xnew, mu1, mu2, ni)
vmnb.pred(xnew, mu, kappa, ni) spmlnb.pred(xnew, mu1, mu2, ni)
xnew |
A numerical matrix with new predictor variables whose group is to be predicted. Each column refers to an angular variable. |
mu |
A matrix with the mean vectors expressed in radians. |
mu1 |
A matrix with the first set of mean vectors. |
mu2 |
A matrix with the second set of mean vectors. |
kappa |
A matrix with the kappa parameters for the vonMises distribution or with the norm of the mean vectors for the circular angular Gaussian distribution. |
ni |
The sample size of each group in the dataset. |
Each column is supposed to contain angular measurements. Thus, for each column a von Mises distribution or an circular angular Gaussian distribution is fitted. The product of the densities is the joint multivariate distribution.
A numerical vector with 1, 2, ... denoting the predicted group.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( runif( 100, 0, 1 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- vm.nb(x, x, ina) a2 <- vmnb.pred(x, a$mu, a$kappa, a$ni)
x <- matrix( runif( 100, 0, 1 ), ncol = 2 ) ina <- rbinom(50, 1, 0.5) + 1 a <- vm.nb(x, x, ina) a2 <- vmnb.pred(x, a$mu, a$kappa, a$ni)
Principal component analysis.
pca(x, center = TRUE, scale = TRUE, k = NULL, vectors = FALSE)
pca(x, center = TRUE, scale = TRUE, k = NULL, vectors = FALSE)
x |
A numerical |
center |
Do you want your data centered? TRUE or FALSE. |
scale |
Do you want each of your variables scaled, i.e. to have unit variance? TRUE or FALSE. |
k |
If you want a specific number of eigenvalues and eigenvectors set it here, otherwise all eigenvalues (and eigenvectors if requested) will be returned. |
vectors |
Do you want the eigenvectors be returned? By dafault this is FALSE. |
The function is a faster version of R's prcomp.
A list including:
values |
The eigenvalues. |
vectors |
The eigenvectors. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( rnorm(1000 * 20 ), ncol = 20) a <- pca(x) x <- NULL
x <- matrix( rnorm(1000 * 20 ), ncol = 20) a <- pca(x) x <- NULL
Principal components regression.
pcr(y, x, k = 1, xnew = NULL)
pcr(y, x, k = 1, xnew = NULL)
y |
A real values vector. |
x |
A matrix with the predictor variable(s), they have to be continuous. |
k |
The number of principal components to use. This can be a single number or a vector starting from 1. In the second case you get results for the sequence of principal components. |
xnew |
If you have new data use it, otherwise leave it NULL. |
The principal components of the cross product of the independent variables are obtained and classical regression is performed.
A list including:
be |
The beta coefficients of the predictor variables computed via the principcal components. |
per |
The percentage of variance of the cross product of the independent variables explained by the k components. |
vec |
The principal components, the loadings. |
est |
The fitted or the predicted values (if xnew is not NULL). |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Jolliffe I.T. (2002). Principal Component Analysis.
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) mod1 <- pcr(y, x, 1) mod2 <- pcr(y, x, 2) mod <- pcr(y, x, k = 1:3) ## all results at once
y <- as.vector(iris[, 1]) x <- as.matrix(iris[, 2:4]) mod1 <- pcr(y, x, 1) mod2 <- pcr(y, x, 2) mod <- pcr(y, x, k = 1:3) ## all results at once
Random effects and weighted least squares meta analysis.
refmeta(yi, vi, tol = 1e-07) wlsmeta(yi, vi)
refmeta(yi, vi, tol = 1e-07) wlsmeta(yi, vi)
yi |
The observations. |
vi |
The variances of the observations. |
tol |
The toleranve value to terminate Brent's algorithm. |
The refmeta command performs random effects estimation, via restricted maximum likelihood estimation (REML), of the common mean. The wlsmeta command implements weighted least squares (WLS) meta analysis. See references for this.
A vector with many elements. The fixed effects mean estimate, the
estimate, the
, the
, the Q test statistic and it's p-value,
the
estimate and the random effects mean estimate.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Annamaria Guolo and Cristiano Varin (2017). Random-effects meta-analysis: The number of studies matters. Statistical Methods in Medical Research, 26(3): 1500-1518.
Stanley T. D. and Doucouliagos H. (2015). Neither fixed nor random: weighted least squares meta-analysis. Statistics in Medicine, 34(13): 2116-2127.
y <- rnorm(30) vi <- rexp(30, 3) refmeta(y, vi) wlsmeta(y, vi)
y <- rnorm(30) vi <- rexp(30, 3) refmeta(y, vi) wlsmeta(y, vi)
Random integer values simulation.
Sample.int(n, size = n, replace = FALSE) Sample(x, size, replace = FALSE)
Sample.int(n, size = n, replace = FALSE) Sample(x, size, replace = FALSE)
x |
A numeric vector for sampling. |
n |
This must be an integer value. The function will then draw random integer values from 1:n. |
size |
The number of integer values to sample. |
replace |
Do you want to sample with replacement? If yes, set this equal to TRUE. |
The function does the same job, up to some level, with R's built-in fuction sample.int
.
A vector with integer values.
Manos Papadakis.
R implementation: Manos Papadakis <[email protected]>. R documentation: Michail Tsagris <[email protected]>.
x <- Sample.int(10, 1000, replace = TRUE) Sample(x,length(x))
x <- Sample.int(10, 1000, replace = TRUE) Sample(x,length(x))
Random values generation from a Be(a, 1) distribution.
rbeta1(n, a)
rbeta1(n, a)
n |
The sample size, a numerical value. |
a |
The shape parameter of the beta distribution. |
The function genrates random values from a Be(a, 1) distribution.
A vector with the simulated data.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
kumar.mle, simplex.mle, collogitnorm.mle, propols.reg
x <- rbeta1(100, 3)
x <- rbeta1(100, 3)
Random values simulation from the uniform distribution.
Runif(n, min = 0, max = 1)
Runif(n, min = 0, max = 1)
n |
The number of values to generate. |
min |
The lower value of the uniform distribution. |
max |
The upper value of the uniform distribution. |
This function does the same job as R's built-in function runif
.
A vector with simulated values.
Manos Papadakis.
R implementation and documentation: Manos Papadakis <[email protected]>.
x <- Runif(1000, 0, 1)
x <- Runif(1000, 0, 1)
Regularised maximum likelihood linear discriminant analysis.
reg.mle.lda(xnew, x, ina, lambda)
reg.mle.lda(xnew, x, ina, lambda)
xnew |
A numerical vector or a matrix with the new observations, continuous data. |
x |
A matrix with numerical data. |
ina |
A numerical vector or factor with consecutive numbers indicating the group to which each observation belongs to. |
lambda |
A vector of regularization values |
Regularised maximum likelihood linear discriminant analysis is performed. The function is not extremely fast, yet is pretty fast.
A matrix with the predicted group of each observation in "xnew".
Every column corresponds to a value. If you have just on value of
, then
you will have one column only.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
regmlelda.cv mle.lda, fisher.da, big.knn, weibull.nb
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] a <- reg.mle.lda(x, x, ina, lambda = seq(0, 1, by = 0.1) )
x <- as.matrix(iris[, 1:4]) ina <- iris[, 5] a <- reg.mle.lda(x, x, ina, lambda = seq(0, 1, by = 0.1) )
test
Repeated measures ANOVA (univariate data) using Hotelling's test.
rm.hotel(x, a = 0.05)
rm.hotel(x, a = 0.05)
x |
A numerical matrix with the repeated measurements. Each column contains the values of the repeated measurements. |
a |
The level of significance, default value is equal to 0.05. |
This is a multivariate test for the equality of means of repeated measurements.
A list including:
m |
The mean vector. |
result |
A vector with the test statistic value, it's associated p-value, the numerator and denominator degrees of freedom and the critical value. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- as.matrix(iris[, 1:4]) ## assume they are repeated measurements rm.hotel(x)
x <- as.matrix(iris[, 1:4]) ## assume they are repeated measurements rm.hotel(x)
Sample quantiles and col/row wise quantiles.
colQuantile(x,probs,parallel=FALSE,cores=0) ## S3 method for class 'matrix' colQuantile(x,probs,parallel=FALSE,cores=0) ## S3 method for class 'data.frame' colQuantile(x,probs,parallel=FALSE,cores=0) rowQuantile(x,probs,parallel=FALSE,cores=0) Quantile(x,probs,parallel=FALSE)
colQuantile(x,probs,parallel=FALSE,cores=0) ## S3 method for class 'matrix' colQuantile(x,probs,parallel=FALSE,cores=0) ## S3 method for class 'data.frame' colQuantile(x,probs,parallel=FALSE,cores=0) rowQuantile(x,probs,parallel=FALSE,cores=0) Quantile(x,probs,parallel=FALSE)
x |
Numeric vector whose sample quantiles are wanted. NA and NaN values are not allowed in numeric vectors. For the col/row versions a numerical matrix or data.frame. |
probs |
Numeric vector of probabilities with values in [0,1], not missing values. Values up to 2e-14 outside that range are accepted and automatically moved to the nearby endpoint by C++. |
parallel |
Do you want to do it in parallel, for column - row major, in C++? TRUE or FALSE. |
cores |
Number of cores to use for parallelism. Valid only when argument parallel is set to TRUE. Default value is 0 and it means the maximum supported cores. |
This is the same function as R's built in "quantile" with its default option, type = 7. We have also implemented it in a col/row-wise fashion.
The function will return a vector of the same mode as the arguments given. NAs will be removed.
Manos Papadakis.
R implementation and documentation: Manos Papadakis [email protected].
x<-rnorm(1000) probs<-runif(10) sum( quantile(x, probs = probs) - Quantile(x, probs) )
x<-rnorm(1000) probs<-runif(10) sum( quantile(x, probs = probs) - Quantile(x, probs) )
Scaled logistic regression.
sclr(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
sclr(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable; a numerical vector with two values (0 and 1). |
x |
A matrix with the data, where the rows denote the samples (and the two groups) and the columns are the variables. This can be a matrix or a data.frame (with factors). |
full |
If this is FALSE, the coefficients and the log-likelihood will be returned only. If this is TRUE, more information is returned. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
When full is FALSE a list including:
theta |
The estimated |
be |
The estimated regression coefficients. |
loglik |
The log-likelihood of the model. |
iters |
The number of iterations required by Newton-Raphson. |
When full is TRUE a list including:
info |
The estimated |
loglik |
The log-likelihood. |
iters |
The number of iterations required by Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Dunning AJ (2006). A model for immunological correlates of protection. Statistics in Medicine, 25(9): 1485-1497. https://doi.org/10.1002/sim.2282.
x <- matrix(rnorm(100 * 2), ncol = 2) y <- rbinom(100, 1, 0.6) ## binary logistic regression a <- sclr(y, x)
x <- matrix(rnorm(100 * 2), ncol = 2) y <- rbinom(100, 1, 0.6) ## binary logistic regression a <- sclr(y, x)
Score test for overdispersion in Poisson regression.
overdispreg.test(y, x)
overdispreg.test(y, x)
y |
A vector with count data. |
x |
A numerical matrix with predictor variables. |
A score test for overdispersion in Poisson regression is implemented.
A vector with two values. The test statistic and its associated p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Yang Z., Hardin J.W. and Addy C.L. (2009). A score test for overdierpsdion in Poisson regression based on the generalised Poisson-2 model. Journal of Statistical Planning and Inference, 139(4): 1514–1521.
ztp.reg, censpois.mle wald.poisrat
y <- rnbinom(100, 10, 0.4) x <- rnorm(100) overdispreg.test(y, x)
y <- rnbinom(100, 10, 0.4) x <- rnorm(100) overdispreg.test(y, x)
Single terms deletion hypothesis testing in a linear regression model.
lm.drop1(y, x, type = "F")
lm.drop1(y, x, type = "F")
y |
The dependent variable, a numerical vector with numbers. |
x |
A numerical matrix with the indendent variables. We add, internally, the first column of ones. |
type |
If you want to perform the usual F (or t) test set this equal to "F". For the test based on the partial correlation set this equal to "cor". |
This is the same function as R's built in drop1 that it works with the F test or the partial correlation coefficient. For the linear regression model, the Wald test is equivalent to the partial F test. So, instead of performing many regression models with single term deletions we perform one regression model with all variables and compute their Wald test effectively. Note, that this is true, only if the design matrix "x" contains the vectors of ones and in our case this must be, strictly, the first column. The second option is to compute the p-value of the partial correlation.
A matrix with two columns. The test statistic and its associated pvalue for each independent variable.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Hastie T., Tibshirani R. and Friedman J. (2008). The Elements of Statistical Learning (2nd Ed.), Springer.
y <- rnorm(150) x <- as.matrix(iris[, 1:4]) a <- lm(y~., data.frame(x) ) drop1(a, test = "F") lm.drop1(y, x )
y <- rnorm(150) x <- as.matrix(iris[, 1:4]) a <- lm(y~., data.frame(x) ) drop1(a, test = "F") lm.drop1(y, x )
The skeleton of a Bayesian network produced by the FEDHC algorithm.
fedhc.skel(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, parallel = FALSE)
fedhc.skel(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, parallel = FALSE)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix
using |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test" in the R package Rfast and the relevant functions work that way. |
alpha |
The significance level (suitable values in (0, 1)) for assessing the p-values. Default value is 0.05. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
parallel |
Set this to TRUE for parallel computations. |
Similar to MMHC and PCHC the first phase consists of a variable selection procedure, the FBED algortihm (Borboudakis and Tsamardinos, 2019).
A list including:
ini.stat |
The test statistics of the univariate associations. |
ini.pvalue |
The initial p-values univariate associations. |
pvalue |
A matrix with the logarithm of the p-values of the updated associations. This final p-value is the maximum p-value among the two p-values in the end. |
runtime |
The duration of the algorithm. |
ntests |
The number of tests conducted during each k. |
G |
The adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
Tsagris M. (2020). The FEDHC Bayesian network learning algorithm. https://arxiv.org/pdf/2012.00113.pdf.
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1-39.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65(1): 31-78.
# simulate a dataset with continuous data x <- matrix( rnorm(200 * 50, 1, 10), nrow = 200 ) a <- fedhc.skel(x)
# simulate a dataset with continuous data x <- matrix( rnorm(200 * 50, 1, 10), nrow = 200 ) a <- fedhc.skel(x)
The skeleton of a Bayesian network learned with the MMHC algorithm.
mmhc.skel(x, method = "pearson", max_k = 3, alpha = 0.05, ini.stat = NULL, R = NULL, parallel = FALSE)
mmhc.skel(x, method = "pearson", max_k = 3, alpha = 0.05, ini.stat = NULL, R = NULL, parallel = FALSE)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix
using |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test" in the R package Rfast and the relevant functions work that way. |
max_k |
The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3. |
alpha |
The significance level (suitable values in (0, 1)) for assessing the p-values. Default value is 0.05. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
parallel |
Set this to TRUE for parallel computations. |
The max_k option: the maximum size of the conditioning set to use in the conditioning independence test.
Larger values provide more accurate results, at the cost of higher computational times. When the sample size is
small (e.g., observations) the max_k parameter should be 3 for example, otherwise the conditional
independence test may not be able to provide reliable results.
A list including:
ini.stat |
The test statistics of the univariate associations. |
ini.pvalue |
The initial p-values univariate associations. |
pvalue |
A matrix with the logarithm of the p-values of the updated associations. This final p-value is the maximum p-value among the two p-values in the end. |
runtime |
The duration of the algorithm. |
ntests |
The number of tests conducted during each k. |
G |
The adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
Michail Tsagris and Stefanos Fafalios.
R implementation and documentation: Michail Tsagris [email protected] and Stefanos Fafalios [email protected].
Tsamardinos, I., Aliferis, C. F. and Statnikov, A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 673-678). ACM.
Brown, L. E., Tsamardinos, I. and Aliferis, C. F. (2004). A novel algorithm for scalable and accurate Bayesian network learning. Medinfo, 711-715.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65(1):31-78.
# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 100), nrow = 300 ) a <- mmhc.skel(x)
# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 100), nrow = 300 ) a <- mmhc.skel(x)
The k-NN algorithm for really lage scale data.
big.knn(xnew, y, x, k = 2:100, type = "C")
big.knn(xnew, y, x, k = 2:100, type = "C")
xnew |
A matrix with new data, new predictor variables whose response variable must be predicted. |
y |
A vector of data. The response variable, which can be either continuous or categorical (factor is acceptable). |
x |
A matrix with the available data, the predictor variables. |
k |
A vector with the possible numbers of nearest neighbours to be considered. |
type |
If your response variable y is numerical data, then this should be "R" (regression). If y is in general categorical set this argument to "C" (classification). |
The concept behind k-NN is simple. Suppose we have a matrix with predictor variables and a vector with the response variable (numerical or categorical). When a new vector with observations (predictor variables) is available, its corresponding response value, numerical or categorical, is to be predicted. Instead of using a model, parametric or not, one can use this ad hoc algorithm.
The k smallest distances between the new predictor variables and the existing ones are calculated. In the case of regression, the average, median, or harmonic mean of the corresponding response values of these closest predictor values are calculated. In the case of classification, i.e. categorical response value, a voting rule is applied. The most frequent group (response value) is where the new observation is to be allocated.
This function allows for the Euclidean distance only.
A matrix whose number of columns is equal to the size of k. If in the input you provided there is just one value of k, then a matrix with one column is returned containing the predicted values. If more than one value was supplied, the matrix will contain the predicted values for every value of k.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Friedman J., Hastie T. and Tibshirani R. (2017). The elements of statistical learning. New York: Springer.
Cover TM and Hart PE (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 13(1):21-27.
bigknn.cv, reg.mle.lda, multinom.reg
x <- as.matrix(iris[, 1:4]) mod <- big.knn(xnew = x, y = iris[, 5], x = x, k = c(6, 7) )
x <- as.matrix(iris[, 1:4]) mod <- big.knn(xnew = x, y = iris[, 5], x = x, k = c(6, 7) )
Tobit regression.
tobit.reg(y, x, ylow = 0, full = FALSE, tol = 1e-07, maxiters = 100)
tobit.reg(y, x, ylow = 0, full = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable; a numerical vector with values. |
x |
A matrix with the data, where the rows denote the samples (and the two groups) and the columns are the variables. This can be a matrix or a data.frame (with factors). |
ylow |
The lowest value below which nothing is observed. The cut-off value. |
full |
If this is FALSE, the coefficients and the log-likelihood will be returned only. If this is TRUE, more information is returned. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The max number of iterations that can take place in each regression. |
The tobit regression model is fitted.
When full is FALSE a list including:
be |
The estimated regression coefficients. |
s |
The estimated scale parameter. |
loglik |
The log-likelihood of the model. |
iters |
The number of iterations required by Newton-Raphson. |
When full is TRUE a list including:
info |
The estimated |
loglik |
The log-likelihood. |
iters |
The number of iterations required by Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tobin James (1958). Estimation of Relationships for Limited Dependent Variables. Econometrica, 26(1): 24–36.
https://en.wikipedia.org/wiki/Tobit_model
hp.reg, ztp.reg, censweibull.mle, censpois.mle
x <- rnorm(100) y <- rnorm(100) y[y < 0] <- 0 a <- tobit.reg(y, x, full = TRUE)
x <- rnorm(100) y <- rnorm(100) y[y < 0] <- 0 a <- tobit.reg(y, x, full = TRUE)
Trimmed mean.
trim.mean(x, a = 0.05,parallel=FALSE) colTrimMean(x, a = 0.05,parallel=FALSE,cores=0) ## S3 method for class 'matrix' colTrimMean(x,a = 0.05,parallel=FALSE,cores=0) ## S3 method for class 'data.frame' colTrimMean(x,a = 0.05,parallel=FALSE,cores=0) rowTrimMean(x, a = 0.05,parallel=FALSE,cores=0)
trim.mean(x, a = 0.05,parallel=FALSE) colTrimMean(x, a = 0.05,parallel=FALSE,cores=0) ## S3 method for class 'matrix' colTrimMean(x,a = 0.05,parallel=FALSE,cores=0) ## S3 method for class 'data.frame' colTrimMean(x,a = 0.05,parallel=FALSE,cores=0) rowTrimMean(x, a = 0.05,parallel=FALSE,cores=0)
x |
A numerical vector or a numerical matrix or data.frame. |
a |
A number in (0, 1), the proportion of data to trim. |
parallel |
Run the algorithm parallel in C++. |
cores |
Number of cores to use for parallelism. Valid only when argument parallel is set to TRUE. Default value is 0 and it means the maximum supported cores. |
The trimmed mean is computed. The lower and upper a% of the data are removed and the mean is calculated using the rest of the data.
The trimmed mean.
Michail Tsagris and Manos Papadakis.
R implementation and documentation: Michail Tsagris [email protected] and Manos Papadakis [email protected].
Wilcox R.R. (2005). Introduction to robust estimation and hypothesis testing. Academic Press.
x <- rnorm(100, 1, 1) all.equal(trim.mean(x, 0.05),mean(x, 0.05)) x<-matrix(x,10,10) colTrimMean(x,0.05) rowTrimMean(x,0.05)
x <- rnorm(100, 1, 1) all.equal(trim.mean(x, 0.05),mean(x, 0.05)) x<-matrix(x,10,10) colTrimMean(x,0.05) rowTrimMean(x,0.05)
Variable selection using the PC-simple algorithm.
pc.sel(y, x, ystand = TRUE, xstand = TRUE, alpha = 0.05)
pc.sel(y, x, ystand = TRUE, xstand = TRUE, alpha = 0.05)
y |
A numerical vector with continuous data. |
x |
A matrix with numerical data; the independent variables, of which some will probably be selected. |
ystand |
If this is TRUE the response variable is centered. The mean is subtracted from every value. |
xstand |
If this is TRUE the independent variables are standardised. |
alpha |
The significance level. |
Variable selection for continuous data only is performed using the PC-simple algorithm (Buhlmann, Kalisch and Maathuis, 2010). The PC algorithm used to infer the skeleton of a Bayesian Network has been adopted in the context of variable selection. In other words, the PC algorithm is used for a single node.
A list including:
vars |
A vector with the selected variables. |
n.tests |
The number of tests performed. |
runtime |
The runtime of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Buhlmann P., Kalisch M. and Maathuis M. H. (2010). Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika, 97(2): 261-278. https://arxiv.org/pdf/0906.3204.pdf
y <- rnorm(100) x <- matrix( rnorm(100 * 50), ncol = 50) a <- pc.sel(y, x)
y <- rnorm(100) x <- matrix( rnorm(100 * 50), ncol = 50) a <- pc.sel(y, x)
Wald confidence interval for the ratio of two Poisson variables.
wald.poisrat(x, y, alpha = 0.05) col.waldpoisrat(x, y, alpha = 0.05)
wald.poisrat(x, y, alpha = 0.05) col.waldpoisrat(x, y, alpha = 0.05)
x |
A numeric vector or a matrix with count data. |
y |
A numeric vector or a matrix with count data. |
alpha |
The 1 - confidence level. The default value is 0.05. |
wald confidence interval for the ratio of two Poisson means is/are calculated.
For the wald.poisrat a vector with three elements, the ratio and the lower and upper confidence interval limits. For the col.waldpoisrat a matrix with three columns, the ratio and the lower and upper confidence interval limits.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Krishnamoorthy K., Peng J. and Zhang D. (2016). Modified large sample confidence intervals for Poisson distributions: Ratio, weighted average, and product of means. Communications in Statistics-Theory and Methods, 45(1): 83-97.
x <- rpois(100, 10) y <- rpois(100, 10) wald.poisrat(x, y)
x <- rpois(100, 10) y <- rpois(100, 10) wald.poisrat(x, y)
Walter's confidence interval for the ratio of two binomial variables (and the relative risk).
walter.ci(x1, x2, n1, n2, a = 0.05)
walter.ci(x1, x2, n1, n2, a = 0.05)
x1 |
An integer number, greater than or equal to zero. |
x2 |
A secondinteger number, greater than or equal to zero. |
n1 |
An integer number, greater than or x1. |
n2 |
A secondinteger number, greater than or equal to x2. |
a |
The significance level. The produced confidence interval has a confidence level equal to 1-a. |
This calculates a (1-a)% confidence interval for the ratio of two binomial variables (and hence for the relative risk) using Walter's suggestion (Walter, 1975). That is, to add 0.5 in each number. This not only overcomes the problem of zero values, but produces intervals that are more accurate than the classical asymptotic confidence interval (Alharbi and Tsagris, 2018).
A list including:
rat |
The ratio of the two binomial distributions. |
ci |
Walter's confidence interval. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Walter S. (1975). The distribution of Levin's measure of attributable risk. Biometrika, 62(2): 371-372.
Alharbi N. and Tsagris M. (2018). Confidence Intervals for the Relative Risk. Biostatistics and Biometrics, 4(5). doi:10.19080/BBOAJ.2018.04.555647
https://juniperpublishers.com/bboaj/pdf/BBOAJ.MS.ID.555647.pdf
x1 <- rbinom(1, 20, 0.7) x2 <- rbinom(1, 30, 0.6) n1 <- 20 n2 <- 30 walter.ci(x1,x2,n1,n2)
x1 <- rbinom(1, 20, 0.7) x2 <- rbinom(1, 30, 0.6) n1 <- 20 n2 <- 30 walter.ci(x1,x2,n1,n2)
Zero inflated Gamma regression.
zigamma.reg(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
zigamma.reg(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with numbers, zeros and higher. |
x |
A numerical matrix with the indendent variables. We add, internally, the first column of ones. |
full |
If this is FALSE, the coefficients and the log-likelihood will be returned only. If this is TRUE, more information is returned. |
tol |
The tolerance value to terminate the Newton-Raphson algorithm. |
maxiters |
The maximum number of iterations that can take place in each regression. |
Two regression models are fitted, a binary logistic regression and a Gamma regression model to the non-zero responses.
Depending on whether "full" is TRUE or not different outputs are returned. In general, the regression coefficients, the iterations required by Newton-Raphson and the deviances are returned. If full is TRUE, a matrix with their standard errors and the Wald test statistics is returned as well.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Mills, Elizabeth Dastrup (2013). Adjusting for covariates in zero-inflated gamma and zero-inflated log-normal models for semicontinuous data. PhD thesis, University of Iowa.
y <- rgamma(100, 4, 1) y[sample(100, 10)] <- 0 x <- rnorm(100) a <- zigamma.reg(y, x)
y <- rgamma(100, 4, 1) y[sample(100, 10)] <- 0 x <- rnorm(100) a <- zigamma.reg(y, x)
Zero truncated Poisson regression.
ztp.reg(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
ztp.reg(y, x, full = FALSE, tol = 1e-07, maxiters = 100)
y |
The dependent variable, a numerical vector with integer valued numbers. |
x |
A matrix or a data.frame with the indendent variables. |
full |
If you want full information (standard errors, Walt test statistics and p-values of the regression coefficients) set this equal to TRUE. |
tol |
The tolerance value required by the Newton-Raphson to stop. |
maxiters |
The maximum iterations allowed. |
A zero truncated poisson regression model is fitted.
A list including:
be |
The regression coefficients if "full" was set to FALSE. |
info |
This is returned only if "full" was set to TRUE. It is a matrix with the regression coefficients, their standard errors, Walt test statistics and p-values. |
loglik |
The loglikelihood of the regression model. |
iter |
The iterations required by the Newton-Raphson. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- rpois(100, 5) y[y == 0] <- 1 x <- matrix( rnorm(100 * 5), ncol = 5 ) mod <- ztp.reg(y, x)
y <- rpois(100, 5) y[y == 0] <- 1 x <- matrix( rnorm(100 * 5), ncol = 5 ) mod <- ztp.reg(y, x)