--- title: "Example for msma" output: html_document: toc: true toc_float: true fig_width: 8 fig_height: 5 rmarkdown::html_vignette: vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{msma} %\VignetteEncoding{UTF-8} --- `r Sys.Date()` @Atsushi Kawaguchi The `msma` package provides functions for a matrix decomposition method incorporating sparse and supervised modeling for a multiblock multivariable data analysis. # Preparation Install package (as necessary) ```{r eval=FALSE} if(!require("msma")) install.packages("msma") ``` Load package ```{r eval=TRUE} library(msma) ``` ```{r echo=FALSE, error=FALSE, eval=FALSE} #source("../package/Ver3_nest/src.r") #predir = unlist(strsplit(getwd(), "ADNI"))[1] #source(paste(predir, "ADNI/Multimodal/program/package/Ver2/src.r", sep="")); library(mvtnorm) ``` # Getting started Simulated multiblock data (list data) by using the function `simdata`. Sample size is 50. The correlation coeficient is 0.8. The numbers of columns for response and predictor can be specified by the argument `Yps` and `Xps`, respectively. The length of vecor represents the number of blocks. That is, response has three blocks with the numbers of columns being 3, 4, and 5 and predictor has one block with the number of columns being 3. ```{r} dataset0 = simdata(n = 50, rho = 0.8, Yps = c(3, 4, 5), Xps = 3, seed=1) X0 = dataset0$X; Y0 = dataset0$Y ``` The data generated here is applied to the `msma` function. ## One Component The argument `comp` can specify the number of components. The arguments `lambdaX` and `lambdaY` can specify the regularization parameters for X and Y, respectively. First, we set `comp`=1, which will perform an analysis with 1 component. ```{r} fit01 = msma(X0, Y0, comp=1, lambdaX=0.05, lambdaY=1:3) fit01 ``` The `plot` function is available. In default setting, the block weights are displayed as a barplot. ```{r} plot(fit01) ``` ## Two Component Next, we set `comp`=2, which will perform an analysis with 2 components. ```{r} fit02 = msma(X0, Y0, comp=2, lambdaX=0.03, lambdaY=0.01*(1:3)) fit02 ``` ################################################################################################ # Single Block ################################################################################################ Two matrics are prepared by specifying arguments `Yps` and `Xps`. ```{r} dataset1 = simdata(n = 50, rho = 0.8, Yps = 5, Xps = 5, seed=1) X1 = dataset1$X[[1]]; Y1 = dataset1$Y ``` ## Principal Component Analysis (PCA) If input is a matrix, a principal component analysis is implemented. ```{r} (fit111 = msma(X1, comp=5)) ``` The weight (loading) vectors can be obtained as follows. ```{r} fit111$wbX ``` The bar plots of weight vectors are provided by the function `plot`. The component number is specified by the argument `axes`. The plot type is selected by the argument `plottype`. Furthermore, since this function uses the `barplot` function originally built into R, its arguments are also available. In the following example, on the horizontal axis, the magnification of the variable names is set to 0.7 by setting `cex.names`=0.7, and the variable names are oriented as `las`=2. ```{r} par(mfrow=c(1,2)) plot(fit111, axes = 1, plottype="bar", cex.names=0.7, las=2) plot(fit111, axes = 2, plottype="bar", cex.names=0.7, las=2) ``` The score vectors for first six subjects. ```{r} lapply(fit111$sbX, head) ``` The scatter plots for the score vectors specified by the argument `v`. The argument `axes` is specified by the two length vector represents which components are displayed. ```{r} par(mfrow=c(1,2)) plot(fit111, v="score", axes = 1:2, plottype="scatter") plot(fit111, v="score", axes = 2:3, plottype="scatter") ``` When the argument `v` was specified as "cpev", the cummulative eigenvalues are plotted. ```{r} par(mfrow=c(1,2)) plot(fit111, v="cpev", ylim=c(0.7, 1)) ``` ### Other functions in R for PCA There is the R function prcomp to implement PCA. ```{r} (fit1112 = prcomp(X1, scale=TRUE)) summary(fit1112) ``` This Rotation is almost the same as the output of `msma`, but it can be made closer by setting the argument `ceps` as follows. ```{r} fit1113 = msma(X1, comp=5, ceps=0.0000001) fit1113$wbX ``` Plotting the scores with the signs turned over, we see that similar scores are calculated. ```{r} par(mfrow=c(1,2)) biplot(fit1112) plot(-fit1113$sbX[[1]][,1:2],xlab="Component 1",ylab="Component 2") ``` The `ggfortify` package is also available for the PCA plot. ### Sparse PCA If `lambdaX` (>0) is specified, a sparse principal component analysis is implemented. ```{r} (fit112 = msma(X1, comp=5, lambdaX=0.1)) par(mfrow=c(1,2)) plot(fit112, axes = 1, plottype="bar", las=2) plot(fit112, axes = 2, plottype="bar", las=2) ``` ### Supervised Sparse PCA The outcome Z is generated. ```{r} set.seed(1); Z = rbinom(50, 1, 0.5) ``` If the outcome Z is specified, a supervised sparse principal component analysis is implemented. ```{r} (fit113 = msma(X1, Z=Z, comp=5, lambdaX=0.02)) ``` ```{r} par(mfrow=c(1,2)) plot(fit113, axes = 1, plottype="bar", las=2) plot(fit113, axes = 2, plottype="bar", las=2) ``` ## Partial Least Squres (PLS) If the another input Y1 is specified, a partial least squres is implemented. ```{r} (fit121 = msma(X1, Y1, comp=2)) ``` The component number is specified by the argument `axes`. When the argument `XY` was specified as "XY", the scatter plots for Y score against X score are plotted. ```{r} par(mfrow=c(1,2)) plot(fit121, axes = 1, XY="XY") plot(fit121, axes = 2, XY="XY") ``` ### Sparse PLS If `lambdaX` and `lambdaY` are specified, a sparse PLS is implemented. ```{r} (fit122 = msma(X1, Y1, comp=2, lambdaX=0.5, lambdaY=0.5)) par(mfrow=c(1,2)) plot(fit122, axes = 1, XY="XY") plot(fit122, axes = 2, XY="XY") ``` ### Supervised Sparse PLS If the outcome Z is specified, a supervised sparse PLS is implemented. ```{r} (fit123 = msma(X1, Y1, Z, comp=2, lambdaX=0.5, lambdaY=0.5)) par(mfrow=c(1,2)) plot(fit123, axes = 1, XY="XY") plot(fit123, axes = 2, XY="XY") ``` ################################################################################################ # Multi Block ################################################################################################ Multiblock data is a list of data matrix. ```{r} dataset2 = simdata(n = 50, rho = 0.8, Yps = c(2, 3), Xps = c(3, 4), seed=1) X2 = dataset2$X; Y2 = dataset2$Y ``` ## PCA The input class is list. ```{r} class(X2) ``` The list length is 2 for 2 blocks. ```{r} length(X2) ``` list of data matrix structure. ```{r} lapply(X2, dim) ``` The function `msma` is applied to this list X2 as follows. ```{r} (fit211 = msma(X2, comp=1)) ``` The bar plots for the block and super weights (loadings) specified the argument `block`. ```{r} par(mfrow=c(1,2)) plot(fit211, axes = 1, plottype="bar", block="block", las=2) plot(fit211, axes = 1, plottype="bar", block="super") ``` ### Sparse PCA If `lambdaX` with the length of 2 (same as the length of blocks) are specified, a multiblock sparse PCA is implemented. ```{r} (fit212 = msma(X2, comp=1, lambdaX=c(0.5, 0.5))) ``` The bar plots for the block and super weights (loadings). ```{r} par(mfrow=c(1,2)) plot(fit212, axes = 1, plottype="bar", block="block", las=2) plot(fit212, axes = 1, plottype="bar", block="super") ``` ### Supervised Sparse PCA If the outcome Z is specified, a supervised analysis is implemented. ```{r} (fit213 = msma(X2, Z=Z, comp=1, lambdaX=c(0.5, 0.5))) ``` ```{r} par(mfrow=c(1,2)) plot(fit213, axes = 1, plottype="bar", block="block", las=2) plot(fit213, axes = 1, plottype="bar", block="super") ``` ### Nested Component A vector of length 2 can be given to the `comp` argument to perform the nested component analysis, which is a method to consider multiple components even in the super component. The first element of the vector corresponds to the number of block components and the second element corresponds to the number of (nested) super components. ```{r} (fit214 = msma(X2, comp=c(2,3))) ``` In this example, there are 2 block components and 3 super components. ```{r} fit214$wbX ``` For the block weights, the number of blocks is 2 since there are two data matrices as shown as follows, and the number of rows is 3 and 4, the number of variables in each. The number of components is 2 for the first element of the vector specified by the comp argument, which is the number of columns in each matrix. ```{r} par(mfrow=c(1,2)) plot(fit214, axes = 1, axes2 = 1, plottype="bar", block="block", las=2) plot(fit214, axes = 2, axes2 = 1, plottype="bar", block="block", las=2) ``` ```{r} fit214$wsX ``` ```{r} par(mfrow=c(2,3)) for(j in 1:2) for(i in 1:3) plot(fit214, axes = j, axes2 = i, plottype="bar", block="super") ``` ## PLS If the another input (list) Y2 is specified, the partial least squared is implemented. ```{r} (fit221 = msma(X2, Y2, comp=1)) ``` ```{r} par(mfrow=c(1,2)) plot(fit221, axes = 1, plottype="bar", block="block", XY="X", las=2) plot(fit221, axes = 1, plottype="bar", block="super", XY="X") ``` ```{r} par(mfrow=c(1,2)) plot(fit221, axes = 1, plottype="bar", block="block", XY="Y", las=2) plot(fit221, axes = 1, plottype="bar", block="super", XY="Y") ``` ### Sparse PLS The regularized parameters `lambdaX` and `lambdaY` are specified vectors with same length with the length of lists X2 and Y2, respectively. ```{r} (fit222 = msma(X2, Y2, comp=1, lambdaX=c(0.5, 0.5), lambdaY=c(0.5, 0.5))) ``` ```{r} par(mfrow=c(1,2)) plot(fit222, axes = 1, plottype="bar", block="block", XY="X", las=2) plot(fit222, axes = 1, plottype="bar", block="super", XY="X") ``` ```{r} par(mfrow=c(1,2)) plot(fit222, axes = 1, plottype="bar", block="block", XY="Y", las=2) plot(fit222, axes = 1, plottype="bar", block="super", XY="Y") ``` ### Supervised Sparse PLS ```{r} (fit223 = msma(X2, Y2, Z, comp=1, lambdaX=c(0.5, 0.5), lambdaY=c(0.5, 0.5))) ``` ```{r} par(mfrow=c(1,2)) plot(fit223, axes = 1, plottype="bar", block="block", XY="X", las=2) plot(fit223, axes = 1, plottype="bar", block="super", XY="X") ``` ```{r} par(mfrow=c(1,2)) plot(fit223, axes = 1, plottype="bar", block="block", XY="Y", las=2) plot(fit223, axes = 1, plottype="bar", block="super", XY="Y") ``` ################################################################################################ # Parameter Selection ################################################################################################ ## Number of Components number of components search ### Single Block #### PCA ```{r} (ncomp11 = ncompsearch(X1, comps = c(1, 5, 10*(1:2)), nfold=5)) plot(ncomp11) ``` ```{r} (ncomp12 = ncompsearch(X1, comps = 20, criterion="BIC")) plot(ncomp12) ``` #### PLS ```{r} ncomp21 = ncompsearch(X2, Y2, comps = c(1, 5, 10*(1:2)), nfold=5) plot(ncomp21) ``` ### Multi Block The multi block structure has ```{r} dataset3 = simdata(n = 50, rho = 0.8, Yps = rep(4, 5), Xps = rep(4, 5), seed=1) X3 = dataset3$X; Y3 = dataset3$Y ``` #### PCA ```{r} ncomp31 = ncompsearch(X3, comps = 20, criterion="BIC") plot(ncomp31) ``` #### Nested ```{r} (ncomp32 = ncompsearch(X3, comps = list(20, 20), criterion="BIC")) ``` ```{r} par(mfrow=c(1,2)) plot(ncomp32,1) plot(ncomp32,2) ``` #### PLS ```{r} ncomp41 = ncompsearch(X3, Y3, comps = c(1, 5, 10*(1:2)), criterion="BIC") plot(ncomp41) ``` ## Combined Search The number of components and regularized parameters can be selected by the function `optparasearch`. The following options are available. ```{r} criteria = c("BIC", "CV") search.methods = c("regparaonly", "regpara1st", "ncomp1st", "simultaneous") ``` ### Single Block #### PCA - The `regparaonly` method searches for the regularized parameters with a fixed number of components. ```{r} (opt11 = optparasearch(X1, search.method = "regparaonly", criterion="BIC", comp=ncomp11$optncomp)) (fit311 = msma(X1, comp=opt11$optncomp, lambdaX=opt11$optlambdaX)) ``` - The `ncomp1st` method identifies the number of components with a regularized parameter of 0, then searches for the regularized parameters with the selected number of components. ```{r} (opt12 = optparasearch(X1, search.method = "ncomp1st", criterion="BIC")) (fit312 = msma(X1, comp=opt12$optncomp, lambdaX=opt12$optlambdaX)) ``` - The `regpara1st` identifies the regularized parameters by fixing the number of components, then searching for the number of components with the selected regularized parameters. ```{r} (opt13 = optparasearch(X1, search.method = "regpara1st", criterion="BIC")) (fit313 = msma(X1, comp=opt13$optncomp, lambdaX=opt13$optlambdaX)) ``` - The `simultaneous` method identifies the number of components by searching the regularized parameters in each component. ```{r} (opt14 = optparasearch(X1, search.method = "simultaneous", criterion="BIC")) (fit314 = msma(X1, comp=opt14$optncomp, lambdaX=opt14$optlambdaX)) ``` The argument `maxpct4ncomp`=0.5 means that 0.5$\lambda$ is used as the regularized parameter when the number of components is searched and where $\lambda$ is the maximum of the regularized parameters among the possible candidates. ```{r} (opt132 = optparasearch(X1, search.method = "ncomp1st", criterion="BIC", maxpct4ncomp=0.5)) (fit3132 = msma(X1, comp=opt132$optncomp, lambdaX=opt132$optlambdaX)) ``` The result with the argument `regpara1st` depends on the number of components and the default value is 10. The number of components is set as follows. ```{r} (opt133 = optparasearch(X1, search.method = "regpara1st", criterion="BIC", comp=5)) (fit3133 = msma(X1, comp=opt133$optncomp, lambdaX=opt133$optlambdaX)) ``` #### PLS For PLS, two parameters $\lambda_X$ and $\lambda_Y$ are used in arguments `lambdaX` and `lambdaY` to control sparseness for data X and Y, respectively. ```{r} (opt21 = optparasearch(X2, Y2, search.method = "regparaonly", criterion="BIC")) (fit321 = msma(X2, Y2, comp=opt21$optncomp, lambdaX=opt21$optlambdaX, lambdaY=opt21$optlambdaY)) ``` ### Multi Block #### PCA ```{r} (opt31 = optparasearch(X3, search.method = "regparaonly", criterion="BIC")) (fit331 = msma(X3, comp=opt31$optncomp, lambdaX=opt31$optlambdaX, lambdaXsup=opt31$optlambdaXsup)) ``` ```{r} (opt32 = optparasearch(X3, search.method = "regparaonly", criterion="BIC", whichselect="X")) (fit332 = msma(X3, comp=opt32$optncomp, lambdaX=opt32$optlambdaX, lambdaXsup=opt32$optlambdaXsup)) ``` ```{r} (opt33 = optparasearch(X3, search.method = "regparaonly", criterion="BIC", whichselect="Xsup")) (fit333 = msma(X3, comp=opt33$optncomp, lambdaX=opt33$optlambdaX, lambdaXsup=opt33$optlambdaXsup)) ``` #### Nested `ncomp1st` ```{r} (opt341 = optparasearch(X3, search.method = "ncomp1st", criterion="BIC", comp=c(8, 8))) (fit341 = msma(X3, comp=opt341$optncomp, lambdaX=opt341$optlambdaX, lambdaXsup=opt341$optlambdaXsup)) ``` `regparaonly` ```{r} (opt342 = optparasearch(X3, search.method = "regparaonly", criterion="BIC", comp=c(4, 5))) (fit342 = msma(X3, comp=opt342$optncomp, lambdaX=opt342$optlambdaX, lambdaXsup=opt342$optlambdaXsup)) ``` `regpara1st` ```{r} (opt344 = optparasearch(X3, search.method = "regpara1st", criterion="BIC", comp=c(8, 8))) (fit344 = msma(X3, comp=opt344$optncomp, lambdaX=opt344$optlambdaX, lambdaXsup=opt344$optlambdaXsup)) ``` This is computationally expensive and takes much longer to execute. ```{r, eval=FALSE} (opt345 = optparasearch(X3, search.method = "simultaneous", criterion="BIC", comp=c(8, 8))) (fit345 = msma(X3, comp=opt345$optncomp, lambdaX=opt345$optlambdaX)) ``` #### PLS This is computationally expensive and takes much longer to execute due to the large number of blocks. ```{r, eval=FALSE} (opt41 = optparasearch(X3, Y3, search.method = "regparaonly", criterion="BIC")) (fit341 = msma(X3, Y3, comp=opt41$optncomp, lambdaX=opt41$optlambdaX, lambdaY=opt41$optlambdaY, lambdaXsup=opt41$optlambdaXsup, lambdaYsup=opt41$optlambdaYsup)) ``` In this example, it works by narrowing down the parameters as follows. ```{r} (opt42 = optparasearch(X3, Y3, search.method = "regparaonly", criterion="BIC", whichselect=c("Xsup","Ysup"))) (fit342 = msma(X3, Y3, comp=opt42$optncomp, lambdaX=opt42$optlambdaX, lambdaY=opt42$optlambdaY, lambdaXsup=opt42$optlambdaXsup, lambdaYsup=opt42$optlambdaYsup)) ``` Another example dataset is generated. ```{r} dataset4 = simdata(n = 50, rho = 0.8, Yps = rep(4, 2), Xps = rep(4, 3), seed=1) X4 = dataset4$X; Y4 = dataset4$Y ``` With this number of blocks, the calculation can be performed in a relatively short time. ```{r} (opt43 = optparasearch(X4, Y4, search.method = "regparaonly", criterion="BIC")) (fit343 = msma(X4, Y4, comp=opt43$optncomp, lambdaX=opt43$optlambdaX, lambdaY=opt43$optlambdaY, lambdaXsup=opt43$optlambdaXsup, lambdaYsup=opt43$optlambdaYsup)) ```