Title: | Dimension Reduction with Dynamic CUR |
---|---|
Description: | Dynamic CUR (dCUR) boosts the CUR decomposition (Mahoney MW., Drineas P. (2009) <doi:10.1073/pnas.0803205106>) varying the k, the number of columns and rows used, and its final purposes to help find the stage, which minimizes the relative error to reduce matrix dimension. The goal of CUR Decomposition is to give a better interpretation of the matrix decomposition employing proper variable selection in the data matrix, in a way that yields a simplified structure. Its origins come from analysis in genetics. The goal of this package is to show an alternative to variable selection (columns) or individuals (rows). The idea proposed consists of adjusting the probability distributions to the leverage scores and selecting the best columns and rows that minimize the reconstruction error of the matrix approximation ||A-CUR||. It also includes a method that recalibrates the relative importance of the leverage scores according to an external variable of the user's interest. |
Authors: | Cesar Gamboa-Sanabria [aut, mdc, cph, cre] , Stefani Matarrita-Munoz [aut] , Katherine Barquero-Mejias [aut] , Greibin Villegas-Barahona [aut] , Mercedes Sanchez-Barba [aqt] , Maria Purificacion Galindo-Villardon [aqt] |
Maintainer: | Cesar Gamboa <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-12-11 06:59:35 UTC |
Source: | CRAN |
Data from a Ph.D. thesis about Academic Achievement Score Projection, with 632 rows and 205 columns.
data("AASP")
data("AASP")
Further information about variables can be found in this link:
https://github.com/cgamboasanabria/dCUR/tree/master/inst/AASP_description_data.xlsx
data(AASP) dim(AASP)
data(AASP) dim(AASP)
This function computes the canonical CUR decomposition using top scores as selection criteria to identify the most relevant columns and rows of a given data matrix. It also provides an option to use an extension of CUR decomposition, which reconfigures leverage scores by using the partial and semi partial correlations with an external variable of interest. Additionally, this function lets the user fit a probability distribution of leverage scores with Mixtures Gaussian Models.
CUR( data, variables, k = NULL, rows, columns, standardize = FALSE, cur_method = "sample_cur", correlation = NULL, correlation_type = c("partial", "semipartial"), ... )
CUR( data, variables, k = NULL, rows, columns, standardize = FALSE, cur_method = "sample_cur", correlation = NULL, correlation_type = c("partial", "semipartial"), ... )
data |
a data frame containing the variables to be used in CUR decomposition and other externals variables with which you want to correlate. |
variables |
correspond to the variables used to compute the leverage scores in CUR analysis. The external variable’s names must not be included. dplyr package notation can be used to specify the variables (see the example). |
k |
corresponds to the number of principal components used to compute the leverage scores. If NULL, it is considered the number of k main components that accumulate 80% of the variability explained. This argument can also be a proportion, in which case the function takes this value as the desired cumulative explained variance and automatically chooses the k. |
rows |
correspond to the proportion of rows to be selected from the total number of rows in the data matrix. When all the rows are needed and |
columns |
correspond to the proportion of columns (variables) to be selected from the total number of variables in the data matrix. |
standardize |
If |
cur_method |
character. If |
correlation |
character. It specifies the name of the external variable the computation of leverage must be adjusted with. |
correlation_type |
character. It specifies if the computation of leverage must be adjusted by the |
... |
Extension of classic CUR descomposition with top scores selection criteria.
CUR decomposition chooses columns and rows that exhibit high leverage scores and exert a disproportionately large “influence” on the best low-rank fit of the data matrix. The main advantage of CUR Decomposition over SVD is that the original data matrix can be expressed as a reduced number of rows and columns instead of obtaining factorial axes resulting from a linear combination of all the original variables to facilitate interpretation.
The reconfiguration of the leverage scores according to the methodology of Villegas et al. (2018) dividing the leverage score by . Where
rho represents the partial or semi-partial correlation that the variables used in CUR decomposition have with an external variable, its purpose is recalibrating the relative importance of the leverage scores according to an external variable of interest.
The correlation type selection could be partial or semi-partial, according to Seongho (2015) of the package in R ppcor.
k |
Number of principal components with which leverages scores are computed. |
CUR |
CUR matrix. |
absolute_error |
Absolute error computed as the Frobenius norm of the original data -detnoted as A- and CUR matrix: ||A-CUR|| |
relative_error |
Relative error |
leverage_columns_sorted |
a data frame which specifies the names of relevant columns and its leverages scores arranged downwardly. |
leverage_rows_sorted |
a data frame which specifies the number of relevant rows and its leverages scores arranged downwardly. |
leverage_columns |
a data frame which specifies the names of all columns and its leverages scores. |
leverage_rows |
a data frame which specifies the number of all rows and its leverages scores. |
Cesar Gamboa-Sanabria, Stefany Matarrita-Munoz, Katherine Barquero-Mejias, Greibin Villegas-Barahona, Mercedes Sanchez-Barba and Maria Purificacion Galindo-Villardon.
Mahoney MW, Drineas P (2009). “CUR matrix decompositions for improved data analysis.” Proceedings of the National Academy of Sciences, 106(3), 697–702. ISSN 0027-8424, doi:10.1073/pnas.0803205106. Villegas G, others (2018). “Modelo estadistico pedagogico para la toma de decisiones administrativas y academicas con impacto en el mejoramiento continuo del rendimiento de los estudiantes universitarios, basado en los metodos de seleccion CUR.” doi:10.14201/gredos.139405. Villegas G, Martin-Barreiro C, Gonzalez-Garcia N, Hernandez-Gonzalez S, Sanchez-Barba M, Galindo-Villardon M (2019). “Dynamic CUR, an alternative to variable selection in CUR decomposition.” Revistas Investigacion Operacional, 40(3), 391–399. Drineas P, Mahoney MW, Muthukrishnan S (2008). “Relative-error cur matrix decompositions.” SIAM Journal on Matrix Analysis and Applications, 30(2), 844–881. https://doi.org/10.1137/07070471X.
#Classic CUR with top scores selection criteria. result <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = 1, columns = .2, standardize = TRUE, cur_method = "sample_cur") result #Extension of classic CUR: Recalibrating leverages scores #and adjusting a mixtures Gaussian models to leverages. result <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = 1, columns = .2, standardize = TRUE, cur_method = "mixture", correlation = R1, correlation_type = "partial") result
#Classic CUR with top scores selection criteria. result <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = 1, columns = .2, standardize = TRUE, cur_method = "sample_cur") result #Extension of classic CUR: Recalibrating leverages scores #and adjusting a mixtures Gaussian models to leverages. result <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = 1, columns = .2, standardize = TRUE, cur_method = "mixture", correlation = R1, correlation_type = "partial") result
Dynamic CUR is a function that boosts the CUR decomposition varying the k, number of columns, and rows used. Its ultimate purpose is to find the stage which minimizes the relative error. The classic CUR and its extensions can be used in dCUR.
Dynamic CUR is an r package that boosts the CUR decomposition varying the k, the number of columns and rows used, and its final purposes to help find the stage, which minimizes the relative error to reduce matrix dimension. Mahoney & Drineas (2009) identified the singular vectors of the SVD as the PCs' interpretation problem and proposed another type of matrix factorization known as CUR Decomposition (Mahoney & Drineas, 2009; Mahoney, Maggioni, & Drineas, 2008; Bodor, Csabai, Mahoney, & Solymosi, 2012). The goal of CUR Decomposition is to give a better interpretation of the matrix decomposition employing proper variable selection in the data matrix, in a way that yields a simplified structure. Its origins come from analysis in genetics. One example is the one showed in Mahoney & Drineas (2009), in which cancer microarrays highlighted to recognize, based on 5000 variables, genetic patterns in patients with soft tissue tumors analyzed with cDNA microarrays. The objective of this package is to show an alternative to variable selection (columns) or individuals (rows) to the ones developed by Mahoney & Drineas (2009). The idea proposed consists of adjusting the probability distributions to the leverage scores and selecting the best columns and rows that minimize the reconstruction error of the matrix approximation \|A-CUR\|. It also includes a method that recalibrates the relative importance of the leverage scores according to an external variable of the user's interest.
dCUR( data, variables, standardize = FALSE, dynamic_columns = FALSE, dynamic_rows = FALSE, parallelize = FALSE, skip = 0.05, ... )
dCUR( data, variables, standardize = FALSE, dynamic_columns = FALSE, dynamic_rows = FALSE, parallelize = FALSE, skip = 0.05, ... )
data |
a data frame that contains the variables to use in CUR decomposition and other externals variables with which you want to correlate. |
variables |
correspond to the variables used to compute the leverage scores in CUR analysis. The external variable’s names must not be included. dplyr package notation can be used to specify the variables (see examples). |
standardize |
logical. If |
dynamic_columns |
logical. If |
dynamic_rows |
logical. If |
parallelize |
logical.If |
skip |
numeric. It specifies the change ratio of columns and rows to be selected. |
... |
additional arguments to be passed to |
This function serves as a basis for selecting the best combination of k (principal components), c (number of columns) and r (number of rows), in other words, the stage that minimizes the relative error , and thus optimizes the number of columns in the analysis, ensuring a percentage of explained variability of the data matrix and facilitating the interpretation of the data set by reducing the dimensionality of the original matrix.
If skip = 0.1
for each k, it is tested with a column proportion of 0, 0.1, 0.11,0.22,...; the same applies for rows. Given the above, it is recommended not to choose a tiny skip, since this implies doing the CUR analysis for more stages.
Parallelizing the function improves its speed significantly.
CUR returns a list of lists, each one represents a stage, and it contains:
k |
Number of principal components with which leverages scores are computed. |
columns |
number of columns selected. |
rows |
number of rows selected. |
relative_error |
relative_error obtained: |
Cesar Gamboa-Sanabria, Stefany Matarrita-Munoz, Katherine Barquero-Mejias, Greibin Villegas-Barahona, Mercedes Sanchez-Barba and Maria Purificacion Galindo-Villardon.
Cesar Gamboa-Sanabria [email protected]
results <- dCUR::dCUR(data=AASP, variables=hoessem:notabachillerato, k=15, rows=0.25, columns=0.25,skip = 0.1, standardize=TRUE, cur_method="sample_cur", parallelize =TRUE, dynamic_columns = TRUE, dynamic_rows = TRUE) results
results <- dCUR::dCUR(data=AASP, variables=hoessem:notabachillerato, k=15, rows=0.25, columns=0.25,skip = 0.1, standardize=TRUE, cur_method="sample_cur", parallelize =TRUE, dynamic_columns = TRUE, dynamic_rows = TRUE) results
This function returns different plots associated with the fitting of leverages scores through Mixture Gaussian Models.
mixture_plots(data)
mixture_plots(data)
data |
An object resulting from a call to CUR when "mixture" is specified as cur_method. |
Gaussian Mixture Models Plots
mixture_plots
returns a list with the following plots:
BIC |
BIC Plot of the Bayesian Information Criterion (BIC) for each number of mixture components. E and V stands for equal variance in mixture components or variable variance, respectively. |
density |
leverages score's density |
Cumulative |
cumulative density of leverages scores. |
QQPlot |
Plot the sample quantiles and controlled quantiles of the inverse of the cumulative distribution function. |
Cesar Gamboa-Sanabria, Stefany Matarrita-Munoz, Katherine Barquero-Mejias, Greibin Villegas-Barahona, Mercedes Sanchez-Barba and Maria Purificacion Galindo-Villardon.
Mahoney MW, Drineas P (2009). “CUR matrix decompositions for improved data analysis.” Proceedings of the National Academy of Sciences, 106(3), 697–702. ISSN 0027-8424, doi:10.1073/pnas.0803205106. Villegas G, others (2018). “Modelo estadistico pedagogico para la toma de decisiones administrativas y academicas con impacto en el mejoramiento continuo del rendimiento de los estudiantes universitarios, basado en los metodos de seleccion CUR.” doi:10.14201/gredos.139405. Villegas G, Martin-Barreiro C, Gonzalez-Garcia N, Hernandez-Gonzalez S, Sanchez-Barba M, Galindo-Villardon M (2019). “Dynamic CUR, an alternative to variable selection in CUR decomposition.” Revistas Investigacion Operacional, 40(3), 391–399.
results <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = .9999999, columns = .10, standardize = TRUE, cur_method = "mixture") mixture_plots(results)
results <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = .9999999, columns = .10, standardize = TRUE, cur_method = "mixture") mixture_plots(results)
optimal_stage
is a function used to select the optimal k, the number of columns and rows of dynamic CUR object; it also produces a data frame and corresponding plots.
optimal_stage(data, limit = 80)
optimal_stage(data, limit = 80)
data |
An object resulting from a call to dCUR. |
limit |
Cumulative percentage average of relative error rate. |
Select the optimal stage of dynamic CUR descomposition
The objective of CUR decomposition is to find the most relevant variables and observations within a data matrix to reduce the dimensionality. It is well known that as more columns (variables) and rows are selected, the relative error will decrease; however, this is not true for k (number of components to compute leverages). Given the above, this function seeks to find the best-balanced stage of k, the number of relevant columns, and rows that have an error very close to the minimum, but at the same time maintain the low-rank fit of the data matrix.
data |
a data frame which specifies the relative error for each stage of CUR decomposition. |
rows_plot |
a plot where the average relative error is shown for each number of relevant rows selected. |
columns_plot |
a plot where the average relative error is shown for each number of relevant columns selected. |
k_plot |
a plot where the average relative error is shown for each k (number of components to compute leverage), given the optimal number of relevant columns and rows. |
optimal |
a data frame where the average relative error is shown for optimal k (number of components to compute leverage), given the optimal number of relevant columns and rows. |
Cesar Gamboa-Sanabria, Stefany Matarrita-Munoz, Katherine Barquero-Mejias, Greibin Villegas-Barahona, Mercedes Sanchez-Barba and Maria Purificacion Galindo-Villardon.
Villegas G, Martin-Barreiro C, Gonzalez-Garcia N, Hernandez-Gonzalez S, Sanchez-Barba M, Galindo-Villardon M (2019). “Dynamic CUR, an alternative to variable selection in CUR decomposition.” Revistas Investigacion Operacional, 40(3), 391–399.
results <- dCUR(data=AASP, variables=hoessem:notabachillerato, k=15, rows=0.25, columns=0.25,skip = 0.1, standardize=TRUE, cur_method="sample_cur", parallelize =TRUE, dynamic_columns = TRUE, dynamic_rows = TRUE) result <- optimal_stage(results, limit = 80) result result$k_plot result$columns_plot result$data result$optimal
results <- dCUR(data=AASP, variables=hoessem:notabachillerato, k=15, rows=0.25, columns=0.25,skip = 0.1, standardize=TRUE, cur_method="sample_cur", parallelize =TRUE, dynamic_columns = TRUE, dynamic_rows = TRUE) result <- optimal_stage(results, limit = 80) result result$k_plot result$columns_plot result$data result$optimal
relevant_variables_plot
returns a bar graph which contains the leverages of the most relevant variable of data matrix according to CUR decomposition.
relevant_variables_plot(data)
relevant_variables_plot(data)
data |
An object resulting from a call to CUR. |
Relevant Variables Plot
Cesar Gamboa-Sanabria, Stefany Matarrita-Munoz, Katherine Barquero-Mejias, Greibin Villegas-Barahona, Mercedes Sanchez-Barba and Maria Purificacion Galindo-Villardon.
Villegas G, Martin-Barreiro C, Gonzalez-Garcia N, Hernandez-Gonzalez S, Sanchez-Barba M, Galindo-Villardon M (2019). “Dynamic CUR, an alternative to variable selection in CUR decomposition.” Revistas Investigacion Operacional, 40(3), 391–399.
result <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = 1, columns = .2, standardize = TRUE, cur_method = "sample_cur") relevant_variables_plot(result)
result <- CUR(data=AASP, variables=hoessem:notabachillerato, k=20, rows = 1, columns = .2, standardize = TRUE, cur_method = "sample_cur") relevant_variables_plot(result)
var_exp
is used to compute the proportion of the fraction of variance explained by a principal component analysis.
var_exp(data, standardize = FALSE, ...)
var_exp(data, standardize = FALSE, ...)
data |
a data frame that contains the variables to be used in CUR decomposition. |
standardize |
logical. If |
... |
Additional arguments to be passed to |
The objective of CUR decomposition is to find the most relevant variables and observations within a data matrix and to reduce the dimensionality. It is well known that as more columns (variables) and rows are selected, the relative error will be lower; however, this is not true for k (number of components to calculate leverages). Given the above, this function seeks to find the best-balanced scenario of k, the number of relevant columns, and rows that have an error very close to the minimum, and that, in turn, uses a smaller amount of information.
var_exp |
a data frame with the proportion of explained variance for each principal component. |
Cesar Gamboa-Sanabria, Stefany Matarrita-Munoz, Katherine Barquero-Mejias, Greibin Villegas-Barahona, Mercedes Sanchez-Barba and Maria Purificacion Galindo-Villardon.
Mahoney MW, Drineas P (2009). “CUR matrix decompositions for improved data analysis.” Proceedings of the National Academy of Sciences, 106(3), 697–702. ISSN 0027-8424, doi:10.1073/pnas.0803205106. Villegas G, others (2018). “Modelo estadistico pedagogico para la toma de decisiones administrativas y academicas con impacto en el mejoramiento continuo del rendimiento de los estudiantes universitarios, basado en los metodos de seleccion CUR.” doi:10.14201/gredos.139405. Villegas G, Martin-Barreiro C, Gonzalez-Garcia N, Hernandez-Gonzalez S, Sanchez-Barba M, Galindo-Villardon M (2019). “Dynamic CUR, an alternative to variable selection in CUR decomposition.” Revistas Investigacion Operacional, 40(3), 391–399.
var_exp(AASP, standardize = TRUE, hoessem:notabachillerato)
var_exp(AASP, standardize = TRUE, hoessem:notabachillerato)