Title: | Random Subspace Method (RSM) for Linear Regression |
---|---|
Description: | Performs Random Subspace Method (RSM) for high-dimensional linear regression to obtain variable importance measures. The final model is chosen based on validation set or Generalized Information Criterion. |
Authors: | Pawel Teisseyre, Robert A. Klopotek |
Maintainer: | Pawel Teisseyre <[email protected]> |
License: | LGPL-2 | LGPL-3 | GPL-2 | GPL-3 |
Version: | 0.5 |
Built: | 2024-10-31 19:50:49 UTC |
Source: | CRAN |
This function produces a dot plot showing final scores from RSM procedure.
## S3 method for class 'regRSM' ImpPlot(object)
## S3 method for class 'regRSM' ImpPlot(object)
object |
Fitted 'regRSM' model object. |
This function produces a dot plot showing final scores from RSM procedure. Final scores describe importances of explanatory variables.
Pawel Teisseyre, Robert A. Klopotek.
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1=regRSM(x,y) ImpPlot(p1)
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1=regRSM(x,y) ImpPlot(p1)
This function produces a plot showing prediction errors on validation set (or the value of Generalized Information Criterion) with respect to the number of variables included in the model.
## S3 method for class 'regRSM' plot(x,...)
## S3 method for class 'regRSM' plot(x,...)
x |
Fitted 'regRSM' model object. |
... |
Other arguments to plot. |
If Generalized Information Criterion (GIC) was used in the second step of RSM procedure (useGIC=TRUE
) then the function
produces a plot showing the value of GIC with respect to the number of variables included in the model. Model corresponding to
the minimal value of GIC is chosen as a final one.
If GIC was not used (useGIC=FALSE
) and the validation set is supplied then the function
produces a plot showing prediction errors on validation set with respect to the number of variables included in the model. Model corresponding to
the minimal value of the prediction error is chosen as a final one.
Pawel Teisseyre, Robert A. Klopotek.
p=500 n=50 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) xval[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) yval = xval %*% beta1 + rnorm(n) p1=regRSM(x,y) plot(p1) p2 = regRSM(x,y,yval,xval,useGIC=FALSE) plot(p2)
p=500 n=50 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) xval[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) yval = xval %*% beta1 + rnorm(n) p1=regRSM(x,y) plot(p1) p2 = regRSM(x,y,yval,xval,useGIC=FALSE) plot(p2)
This function makes predictions from a 'regRSM' object.
## S3 method for class 'regRSM' predict(object, xnew,...)
## S3 method for class 'regRSM' predict(object, xnew,...)
object |
Fitted 'regRSM' model object |
xnew |
Matrix of new values for x at which predictions are to be made. |
... |
Additional arguments not used. |
Prediction is made based on a final model which is chosen using validation set or Generalized Information Criterion (GIC).
predict.regRSM produces a vector of predictions.
Pawel Teisseyre, Robert A. Klopotek.
p = 100 n = 100 beta1 = numeric(p) beta1[c(1,5,10)] = c(1,1,1) x = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j] = rnorm(n,0,1) xtest[,j] = rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1 = regRSM(x,y) predict(p1,xtest)
p = 100 n = 100 beta1 = numeric(p) beta1[c(1,5,10)] = c(1,1,1) x = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j] = rnorm(n,0,1) xtest[,j] = rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1 = regRSM(x,y) predict(p1,xtest)
This function print the summary of the RSM procedure.
## S3 method for class 'regRSM' print(x,...)
## S3 method for class 'regRSM' print(x,...)
x |
Fitted 'regRSM' model object. |
... |
Additional arguments not used. |
The function prints out information about the selection method, screening, initial weights, version (sequential or parallel), size of the random subpace, number of simulations.
Pawel Teisseyre, Robert A. Klopotek.
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1=regRSM(x,y) print(p1)
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1=regRSM(x,y) print(p1)
Performs Random Subspace Method (RSM) for high-dimensional linear regression to obtain variable importance measures. The final model is chosen based on validation set or Generalized Information Criterion.
## S3 method for class 'formula' regRSM(formula, data=NULL, ...) ## Default S3 method: regRSM(x, y, yval, xval, m, B, parallel, nslaves, store_data, screening, init_weights, useGIC, thrs, penalty,...)
## S3 method for class 'formula' regRSM(formula, data=NULL, ...) ## Default S3 method: regRSM(x, y, yval, xval, m, B, parallel, nslaves, store_data, screening, init_weights, useGIC, thrs, penalty,...)
formula |
Formula describing the model to be fitted. |
data |
Data frame containing the variables in the model |
y |
Quatitative response vector of length |
x |
Input matrix with |
yval |
Optional quatitative response vector from validation set. Default is |
xval |
Optional input matrix from validation set. Default is |
m |
The size of the random subspace. Default is |
B |
Number of repetitions in RSM procedure. Default is 1000. |
parallel |
This argument indicates which version should be used. Default is |
nslaves |
Number of slaves. Default is 4. |
store_data |
Logical argument indicating whether matrix |
screening |
If the screeing argument is in |
init_weights |
This argument indicates whether weighted version of the procedure should be used.
If the |
useGIC |
Logical argument indicating whether Generalized Information Criterion (GIC) should be used in the second step of the procedure.
Default is |
thrs |
Cut off threshold. The hierarchical list of models given by the ordering of variables is cut off at level |
penalty |
Penalty in Generalized Information Criterion (GIC). Default is |
... |
other arguments not available now. |
The Random Subspace Method (RSM) is used to compute importance measures of explanatory variables (RSM final scores).
In the second step the variables are ordered with respect to the final scores.
From the nested list of models, given by the ordering, the final model is selected (the list is truncated at the level thrs
to avoid fitting models which are close to saturated model).
By default the final model that minimizes Generalized Information Criterion (GIC) is chosen.
If the validation set is supplied and useGIC=FALSE
then the final model that minimizes prediction error on validation set is selected.
When screening and weighted version are used together, in the first step screening is performed and then the weighted version (WRSM) is used on the reamining variables.
When parallel=NO
sequential code is used for computaion. Else in our implementation the most costly operation (number of repetitions in RSM procedure) is parallelized. It is very inefficient to
use parallelisation on machine with only single processor with singe core.
regRSM
function in parallel mode does not close created slaves
, because creation of slaves is usually very time consuming. Next parallel call will reuse existing slave
processes. If you want change the number of slaves
please execute mpi.close.Rslaves()
(if parallel=MPI
) and then call function regRSM
with new parameter nslaves
.
When parallel=POSIX
then OpenMP
like parallel implementation is used. This parallel execution is handled by doParallel
library. It uses parallelisation of loops. The optimal value of nslaves
is the number of prosessor cores in a machine.
When parallel=MPI
then MPI parallel implementation is used. This parallel execution is handled by Rmpi
library. MPI (Message Passing Interface) uses messages to send job tasks from main process (master) to other processes (slaves). These processes can be running not nessesery on one machine. In our implementation the most costly operation (number of repetitions in RSM procedure) is parallelized. The optimal value of nslaves
is the number of computing cores of all machines configured in MPI framework. If only one machine is used, the best value is number of prosessor cores. If you don't want to use this kind of parallel computations any more remember to close MPI framework by calling mpi.close.Rslaves()
.
regRSM
function in parallel mode does not close created slaves
, because creation and destruction of slaves is usually very time consuming. Next parallel call will reuse existing slave
processes. If you want change number of slaves
please execute mpi.close.Rslaves()
and then call function regRSM
with new parameter nslaves
.
If you don't want to use parallel computations any more remember to close MPI framework by calling mpi.close.Rslaves()
.
Installing MPI for multiple machines:
In the following we give some guidelines how to install and configure MPI framework on multiple machines. MPI configuration on multiple machines is straightforward. Each machine must be connected to the main machine (master). While using Rmpi package we must remember that it works a little bit different than typical C MPI application. Usually master process transfers through MPI the whole application and replicates it on available slots (slaves). Rmpi uses existing R installation, so on each machine all required packages must be installed. Only R source code and data are transferred. We present the required steps under Ubuntu operating system (we use Ubuntu 12.04 LTS version). To install Open MPI on Ubuntu type:
sudo apt-get install libopenmpi-dev openmpi-bin
On Ubuntu with installed Open MPI and R one may just run R and type:
install.packages("Rmpi")
Consider a case when we have several (2 or more) machines with Ubuntu 12.04 LTS operating system, R 3.0, Rmpi and regRSM installed and all machines are connected to the same network. Moreover let's assume we have one network card which is mapped to eth0
. With command ifconfig
we can check what ip addresses our machines have. For simplicity, to avoid changing the configuration, we assign a static address to each machine.
In our network we have 4 PCs with 4 core processor each. We give them the following names and ip addresses:
node09: 10.200.1.159
node08: 10.200.1.158
node07: 10.200.1.157
node06: 10.200.1.156
We create text file with ip and number of slots in each line. Slot is an instance of our application working in a slave mode. For example if we have a line 127.0.0.1 slots=4
then on our machine (localhost) MPI should run up to 4 slave processes. If we request more slaves than slots then there will be oversubscription of the node and the performance can drop. We can limit the number of slots to 4 by changing the line to 127.0.0.1 slots=4 max_slots=4
. In this case request on more than 4 processes on this node will result in an error. While setting hard limits one should remember that the total number of processes created by Rmpi package is equal to the number of slaves plus one (master process). For example if we want each computer to run 4 parallel tasks then we assign 4 slots to each machine.
Example of our hostfile myhosts
:
10.200.1.159 slots=4
10.200.1.157 slots=4
10.200.1.158 slots=4
10.200.1.156 slots=4
We run MPI application by executing:
mpiexec -n <no_of_program_copies> -hosts <file_with_hosts> <program_name>
Parameter -n
can be misleading when working with Rmpi
package. We want to start one R instance on which we run our experiment. Thus this value should be set to 1.
To give Rmpi our hostfile just run command:
mpiexec -n 1 -hostfile myhosts R --no-save
which means we run one Rscript
process with given hostfile for MPI configuration. In R terminal we type:
> library(Rmpi)
library(Rmpi)
> mpi.spawn.Rslaves()
mpi.spawn.Rslaves()
16 slaves are spawned successfully. 0 failed.
master (rank 0 , comm 1) of size 17 is running on: node09
slave1 (rank 1 , comm 1) of size 17 is running on: node09
slave2 (rank 2 , comm 1) of size 17 is running on: node09
slave3 (rank 3 , comm 1) of size 17 is running on: node09
... ... ...
slave15 (rank 15, comm 1) of size 17 is running on: node06
slave16 (rank 16, comm 1) of size 17 is running on: node09
The above lines indicate that all MPI processes are launched successfully.
scores |
RSM final scores. |
model |
The final model chosen from the list given by the ordering of variables according to the RSM scores. |
time |
Computational time. |
data_transfer |
Data transfer time. |
coefficients |
Coefficients in the selected linear model. |
input_data |
Input data |
control |
List constining information about input parameters. |
informationCriterion |
Values of Generalized Information Criterion calculated for all models from the nested list given by the ordering. |
predError |
Prediction errors on validation set calculated for all models from the nested list given by the ordering. |
Pawel Teisseyre, Robert A. Klopotek.
Mielniczuk, J., Teisseyre, P., Using random subspace method for prediction and variable importance assessment in linear regression, Computational Statistics and Data Analysis, Vol. 71, 725-742, 2014.
predict
, plot
, ImpPlot
, validate
, roc
methods.
p = 500 n = 50 beta1 = numeric(p) beta1[c(1,5,10)] = c(1,1,1) x = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j] = rnorm(n,0,1) xtest[,j] = rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1 = regRSM(x,y) data1 = data.frame(y,x) p2 = regRSM(y~.,data=data1)
p = 500 n = 50 beta1 = numeric(p) beta1[c(1,5,10)] = c(1,1,1) x = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j] = rnorm(n,0,1) xtest[,j] = rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1 = regRSM(x,y) data1 = data.frame(y,x) p2 = regRSM(y~.,data=data1)
This function produces ROC curve and computes AUC parameter.
## S3 method for class 'regRSM' roc(object, truemodel, plotit, ...)
## S3 method for class 'regRSM' roc(object, truemodel, plotit, ...)
object |
Fitted 'regRSM' model object. |
truemodel |
User specified vector containing indexes of all significant variables. |
plotit |
Logical argument indicating whether a plot should be produced. If the value is |
... |
Other arguments to plot. |
Let be the ordering of variables (e.g. given by the RSM final scores),
is the number of all variables.
ROC curve for ordering is defined as
where
denotes cardinality of
and
denotes a complement of
.
This function is useful for the evaluation of the ranking produced by the RSM procedure,
when the set of significant variables is known (e.g. in the simulation experiments on artificial datasets). When AUC is equal one it means that all significant
variables, suplied by the user in argment truemodel
, are placed on the top of the ranking list.
ROC curve is produced and the value of parameter AUC is returned.
Pawel Teisseyre, Robert A. Klopotek.
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1 = regRSM(x,y,store_data=TRUE) true = c(1,5,10) roc(p1,true,plotit=TRUE)
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1 = regRSM(x,y,store_data=TRUE) true = c(1,5,10) roc(p1,true,plotit=TRUE)
This function print the summary of the RSM procedure.
## S3 method for class 'regRSM' summary(object,...)
## S3 method for class 'regRSM' summary(object,...)
object |
Fitted 'regRSM' model object. |
... |
Additional arguments not used. |
The function prints out information about the selection method, screening, initial weights, version (sequential or parallel), size of the random subpace, number of simulations.
Pawel Teisseyre, Robert A. Klopotek.
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1=regRSM(x,y) summary(p1)
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) xtest = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) p1=regRSM(x,y) summary(p1)
This function selects the new final model based on the previously computed final scores.
## S3 method for class 'regRSM' validate(object, yval, xval)
## S3 method for class 'regRSM' validate(object, yval, xval)
object |
Fitted 'regRSM' model object. |
yval |
Quantitative response vector from validation set. |
xval |
Input matrix from validation set. |
To use the function, the argument store_data in the 'regRSM' object must be TRUE. The function uses final scores from 'regRSM' object to create a ranking of variables. Then the final model which minimizes the prediction error on specified validation set is chosen. Object of class 'regRSM' is returned. The final scores in the original 'regRSM' object and in the new one coincide. However the final models can be different.
Object of class 'regRSM' is returned.
Pawel Teisseyre, Robert A. Klopotek.
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) xval[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) yval = xval %*% beta1 + rnorm(n) p1 = regRSM(x,y,store_data=TRUE) p2 = validate(p1,yval,xval)
p=100 n=100 beta1 = numeric(p) beta1[c(1,5,10)]=c(1,1,1) x = matrix(0,ncol=p,nrow=n) xval = matrix(0,ncol=p,nrow=n) for(j in 1:p){ x[,j]=rnorm(n,0,1) xval[,j]=rnorm(n,0,1) } y = x %*% beta1 + rnorm(n) yval = xval %*% beta1 + rnorm(n) p1 = regRSM(x,y,store_data=TRUE) p2 = validate(p1,yval,xval)