PPLasso package

Introduction

This package provides functions for implementing the PPLasso (Prognostic Predictive Lasso) approach described in [1] to identify prognostic and predictive biomarkers in high dimensional settings. This method is designed by taking into account the correlations that may exist between the biomarkers. It consists in rewriting the initial high-dimensional linear model to remove the correlation existing between the predictors and in applying the generalized Lasso criterion. We refer the reader to the paper for further details.

We suppose that the response variable y satisfy the following linear model: where $$ \boldsymbol{X}=\begin{bmatrix} 1 & 0 & X_{11}^{1} & X_{11}^{2} & \ldots & X_{11}^{p} & 0 & 0 & \ldots & 0 \\ 1 & 0 & X_{12}^{1} & X_{12}^{2} & \ldots & X_{12}^{p} & 0 & 0 & \ldots & 0 \\ \vdots & \vdots & \vdots& \vdots& & \vdots & & & \\ 1 & 0 & X_{1n_{1}}^{1} & X_{1n_{1}}^{2} & \ldots & X_{1n_{1}}^{p} & 0 & 0 & \ldots & 0 \\ 0 & 1 & 0 & 0 & \ldots & 0 & X_{21}^{1} & X_{21}^{2} & \ldots & X_{21}^{p} \\ 0 & 1 & 0 & 0 & \ldots & 0 & X_{22}^{1} & X_{22}^{2} & \ldots & X_{22}^{p} \\ \vdots & \vdots & \vdots& \vdots& & \vdots& \vdots&\vdots & &\vdots\\ 0 & 1 & 0 & 0 & \ldots & 0 & X_{2n_{2}}^{1} & X_{2n_{2}}^{2} & \ldots & X_{2n_{2}}^{p} \end{bmatrix}, $$ with γ = (α1, α2, β1′, β2′)′. α1 (resp. α2) corresponding to the effects of treatment t1 (resp. t2). Moreover, β1 = (β11, β12, …, β1p)′ (resp. β2 = (β21, β22, …, β2p)′) are the coefficients associated to each of the p biomarkers in treatment t1 (resp. t2) group. When t1 stands for the standard treatment (placebo), prognostic (resp. predictive) biomarkers are defined as those having non-zero coefficients in β1 (resp. in β1 − β2) and non prognostic (resp. non predictive) biomarkers correspond to the indices having null coefficients in β1 (resp. in β1 − β2). The vector β1 and β2 − β1 are assumed to be sparse, a majority of its components is equal to zero. The goal of the PPLasso approach is to retrieve the indices of the nonzero components of β1 and β2 − β1.

Concerning the biomarkers, are the design matrices for t1 and t2 groups, respectively. The rows of X1 and X2 are assumed to be the realizations of independent centered Gaussian random vectors having a covariance matrix equal to Σ.

Data generation

Correlation matrix Σ

We consider a correlation matrix having the following block structure:

where Σ11 is the correlation matrix of active variables (non null associated coefficients) with off-diagonal entries equal to a1, Σ22 is the one of non active variables (null associated coefficients) with off-diagonal entries equal to a3 and Σ12 is the correlation matrix between active and non active variables with entries equal to a2. In the following example: (a1, a2, a3) = (0.3, 0.5, 0.7).
The first 10 variables are assumed to be active, among which the first 5 are also predictive.

In the following, p = 50 and n = 50 are used for the example but the approach can handle much larger values of n and p as it is shown in the paper describing PPLasso.

p <- 50 # number of variables 
d <- 10 # number of actives
n <- 50 # number of samples
actives <- 1:d
nonacts <- c(1:p)[-actives]
Sigma <- matrix(0, p, p)
Sigma[actives, actives] <- 0.3
Sigma[-actives, actives] <- 0.5
Sigma[actives, -actives] <- 0.5
Sigma[-actives, -actives] <- 0.7
diag(Sigma) <- rep(1,p)
actives_pred <- 1:5

Generation of X and y

The design matrix is then generated with the correlation matrix Σ previously defined by using the function and the response variable y is generated according to the linear model where the non null components of β1 are equal to 1 and non null components of β2 − β1 are equal to 0.5, α1 = 0 and α2 = 1.

X_bm <- MASS::mvrnorm(n = n, mu=rep(0,p), Sigma, tol = 1e-6, empirical = FALSE)
colnames(X_bm) <- paste0("X",(1:p))
n1=n2=n/2 # 1:1 randomized
beta1 <- rep(0,p)
beta1[actives] <- 1
beta2 <- beta1
beta2[actives_pred] <- 2
beta <- c(beta1, beta2)
TRT1 <- c(rep(1,n1), rep(0,n2))
TRT2 <- c(rep(0,n1), rep(1,n2))
Y <- cbind(X_bm*TRT1,X_bm*TRT2)%*%beta+TRT2+rnorm(n,0,1)

Estimation of Σ

Given y and X, we can estimate the block-wise correlation matrix Σ containing the correlations between the columns of X. We propose to use the function of the R package by keeping the default parameters.

cv_cov_est_out <- cvCovEst(
      dat = X_bm,
      estimators = c(
        linearShrinkLWEst, denseLinearShrinkEst,
        thresholdingEst, poetEst, sampleCovEst
      ),
      estimator_params = list(
        thresholdingEst = list(gamma = c(0.2, 0.4)),
        poetEst = list(lambda = c(0.1, 0.2), k = c(1L, 2L))
      ),
      cv_loss = cvMatrixFrobeniusLoss,
      cv_scheme = "v_fold",
      v_folds = 5
    )
Sigma_est <- cov2cor(cv_cov_est_out$estimate) 

The optimal estimation of Σ can be obtained by the object in the output.

Variable selection

With the previous X and y, the function of the package can be used to select the active variables. If the parameter (correlation matrix) is not provided, it will be automatically estimated by the function of the R package presented in the previous section. However, it can also be provided by the users. Here we use the previously estimated $\widehat{\boldsymbol{\Sigma}}$: .

mod <- ProgPredLasso(X1 = X_bm[1:n1, ], X2 = X_bm[(n1+1):n, ], Y = Y, cor_matrix = Sigma_est)

Additional arguments:

  • : parameter of thresholding appearing in the method described in [1] which is set to 0.95 by default.
  • : integer specifying the maximum number of steps for the generalized Lasso algorithm. Its default value is 500.

Outputs:

  • : all the λ considered.
  • : matrix of the estimations of γ for all the λ considered. Each row of corresponds to γ̂ for a given λ. More precisely, the first (resp. second) column corresponds to the estimation of treatment effect α1 (resp. α2). The 3rd to (p + 2)th columns correspond to the estimation of β1 and the last p columns correspond to the estimation of β2 − β1.
  • : estimation of γ obtained for the λ minimizing the BIC criterion.
  • : BIC criterion for all the λ considered.
  • : MSE (Mean Squared Error) for all the λ considered.

Estimation of γ

The estimation of the treatment effects α1 and α2 are obtained as follows:

#alpha1
 mod$beta.min[1]
## [1] 0.116741
#alpha2 
 mod$beta.min[2]
## [1] 0.910072

The identified prognostic (resp. predictive) biomarkers are displayed on the left (resp. right) of Figure with true prognostic or predictive biomarkers in blue and false positives in red.

\label{fig:fig1}Left: Identified prognostic biomarkers. Right: Identified predictive biomarkers.\label{fig:fig1}Left: Identified prognostic biomarkers. Right: Identified predictive biomarkers.

Left: Identified prognostic biomarkers. Right: Identified predictive biomarkers.

To find the biomarkers identified as prognostic:

which(beta_min[1:p]!=0)
##  [1]  1  2  3  4  5  6  7  8  9 10 12 13 18 20 21 23 28 32 34 36 40 42 43 44 45
## [26] 46 49

and the biomarkers identified as predictive:

which(beta_min[(p+1):(2*p)]!=0)
##  [1]  1  3  4  5  8 21 22 27 30 32 37 39 43

[1] W. Zhu, C. Lévy-Leduc, N. Ternès. Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso, 2022, Arxiv: 2202.01970.