Title: | The 'epilogi' Variable Selection Algorithm for Continuous Data |
---|---|
Description: | The 'epilogi' variable selection algorithm is implemented for the case of continuous response and predictor variables. The relevant paper is: Lakiotaki K., Papadovasilakis Z., Lagani V., Fafalios S., Charonyktakis P., Tsagris M. and Tsamardinos I. (2023). "Automated machine learning for Genome Wide Association Studies". Bioinformatics. <doi:10.1093/bioinformatics/btad545>. |
Authors: | Michail Tsagris [aut, cre] |
Maintainer: | Michail Tsagris <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1 |
Built: | 2024-11-10 06:16:25 UTC |
Source: | CRAN |
The 'pilogi' Variable Selection Algorithm for Continuous Data.
Package: | epilogi |
Type: | Package |
Version: | 1.1 |
Date: | 2024-09-10 |
License: | GPL-2 |
Michail Tsagris [email protected].
Michail Tsagris [email protected].
Lakiotaki K., Papadovasilakis Z., Lagani V., Fafalios S., Charonyktakis P., Tsagris M. and Tsamardinos I. (2023). Automated machine learning for Genome Wide Association Studies. Bioinformatics.
Tsagris M., Papadovasilakis Z., Lakiotaki K. and Tsamardinos I. (2022). The -OMP algorithm for feature selection with application to gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(2): 1214–1224.
The pilogi Variable Selection Algorithm for Continuous Data.
epilogi(y, x, tol = 0.01, alpha = 0.05)
epilogi(y, x, tol = 0.01, alpha = 0.05)
y |
A vector with the continuous response variable. |
x |
A matrix with the continuous predictor variables. |
tol |
The tolerance value for the algortihm to terminate. This takes values greater than 0 and it refers to the change between two successive |
alpha |
The significance level to deem a predictor variable is statistically equivalent to a selected variable. |
The pilogi variable selection algorithm (Lakiotaki et al., 2023) is a generalisation of the
-OMP algorithm (Tsagris et al. 2022). It applies the aforementioned algorithm with the addition that it returns possible statistically equivalent predictor(s) for each selected predictor. Once a variable is selected the algorithm searches for possible equivalent predictors using the partial correlation between the residuals.
The heuristic method to consider two predictors R and C informationally equivalent given the current selected predictor S is determined as follows: first, the residuals r of the model using S are computed. Then, if the following two conditions hold R and C are considered equivalent: Ind(R; r | C) and Ind(r ; C | R), where Ind(R; r | C) denotes the conditional independence of R with r given C. When linearity is assumed, the test can be implemented by testing for significance the corresponding partial correlation. The tests Ind return a p-value and independence is accepted when it is larger than a threshold (significance value, argument alpha). Intuitively, R and C are heuristically considered equivalent, if C is known, then R provides no additional information for the residuals r, and if R is known, then C provides no additional information for r.
A list including:
runtime |
The runtime of the algorithm. |
result |
A matrix with two columns. The selected predictor(s) and the adjusted |
equiv |
A list with the equivalent predictors (if any) corresponding to each selected predictor. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Lakiotaki K., Papadovasilakis Z., Lagani V., Fafalios S., Charonyktakis P., Tsagris M. and Tsamardinos I. (2023). Automated machine learning for Genome Wide Association Studies. Bioinformatics.
Tsagris M., Papadovasilakis Z., Lakiotaki K. and Tsamardinos I. (2022). The -OMP algorithm for feature selection with application to gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(2): 1214–1224.
#simulate a dataset with continuous data set.seed(1234) n <- 500 x <- matrix( rnorm(n * 50, 0, 30), ncol = 50 ) #define a simulated class variable y <- 2 * x[, 1] - 1.5 * x[, 2] + x[, 3] + rnorm(n, 0, 15) # define some simulated equivalences x[, 4] <- x[, 1] + rnorm(n, 0, 1) x[, 5] <- x[, 2] + rnorm(n, 0, 1) epilogi(y, x, tol = 0.05)
#simulate a dataset with continuous data set.seed(1234) n <- 500 x <- matrix( rnorm(n * 50, 0, 30), ncol = 50 ) #define a simulated class variable y <- 2 * x[, 1] - 1.5 * x[, 2] + x[, 3] + rnorm(n, 0, 15) # define some simulated equivalences x[, 4] <- x[, 1] + rnorm(n, 0, 1) x[, 5] <- x[, 2] + rnorm(n, 0, 1) epilogi(y, x, tol = 0.05)
Equivalence test using partial correlation.
pcor.equiv(res, y, x, alpha = 0.05)
pcor.equiv(res, y, x, alpha = 0.05)
res |
A vector with the residuals of the linear model. |
y |
A vector with a selected predictor. |
x |
A matrix with other predictors. |
alpha |
The significance level to check for predictors from x that are equivalent to y. |
A vector with 0s and 1s. 0s indicate that the predictors are not equivalent, while 1s indicate the equivalent predictors.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
#simulate a dataset with continuous data set.seed(1234) n <- 500 x <- matrix( rnorm(n * 50, 0, 30), ncol = 50 ) #define a simulated class variable y <- 2 * x[, 1] - 1.5 * x[, 2] + x[, 3] + rnorm(n, 0, 15) # define some simulated equivalences x[, 4] <- x[, 1] + rnorm(n, 0, 1) x[, 5] <- x[, 2] + rnorm(n, 0, 1) b <- epilogi(y, x, tol = 0.05) sel <- b$result[2, 1] ## standardise the y and x first y <- (y - mean(y)) / Rfast::Var(y, std = TRUE) x <- Rfast::standardise(x) res <- resid( lm(y ~ x[, sel] ) ) sela <- b$result[2:3, 1] pcor.equiv(res, x[, sela[2]], x[, -sela] ) ## bear in mind that this gives the third variable after removing the first two, ## so this is essentially the 5th variable in the "x" matrix.
#simulate a dataset with continuous data set.seed(1234) n <- 500 x <- matrix( rnorm(n * 50, 0, 30), ncol = 50 ) #define a simulated class variable y <- 2 * x[, 1] - 1.5 * x[, 2] + x[, 3] + rnorm(n, 0, 15) # define some simulated equivalences x[, 4] <- x[, 1] + rnorm(n, 0, 1) x[, 5] <- x[, 2] + rnorm(n, 0, 1) b <- epilogi(y, x, tol = 0.05) sel <- b$result[2, 1] ## standardise the y and x first y <- (y - mean(y)) / Rfast::Var(y, std = TRUE) x <- Rfast::standardise(x) res <- resid( lm(y ~ x[, sel] ) ) sela <- b$result[2:3, 1] pcor.equiv(res, x[, sela[2]], x[, -sela] ) ## bear in mind that this gives the third variable after removing the first two, ## so this is essentially the 5th variable in the "x" matrix.