Introduction

This package provides functions for implementing the PPLasso (Prognostic Predictive Lasso) approach described in [1] to identify prognostic and predictive biomarkers in high dimensional settings. This method is designed by taking into account the correlations that may exist between the biomarkers. It consists in rewriting the initial high-dimensional linear model to remove the correlation existing between the predictors and in applying the generalized Lasso criterion. We refer the reader to the paper for further details.

We suppose that the response variable y satisfy the following linear model: where $X = [\begin{matrix} 1 & 0 & X_{11}^{1} & X_{11}^{2} & \dots & X_{11}^{p} & 0 & 0 & \dots & 0 \\ 1 & 0 & X_{12}^{1} & X_{12}^{2} & \dots & X_{12}^{p} & 0 & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 1 & 0 & X_{1 n_{1}}^{1} & X_{1 n_{1}}^{2} & \dots & X_{1 n_{1}}^{p} & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & 0 & \dots & 0 & X_{21}^{1} & X_{21}^{2} & \dots & X_{21}^{p} \\ 0 & 1 & 0 & 0 & \dots & 0 & X_{22}^{1} & X_{22}^{2} & \dots & X_{22}^{p} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 1 & 0 & 0 & \dots & 0 & X_{2 n_{2}}^{1} & X_{2 n_{2}}^{2} & \dots & X_{2 n_{2}}^{p} \end{matrix}],$ with γ = (α₁, α₂, β₁′, β₂′)′. α₁ (resp. α₂) corresponding to the effects of treatment t₁ (resp. t₂). Moreover, β₁ = (β₁₁, β₁₂, …, β_1p)′ (resp. β₂ = (β₂₁, β₂₂, …, β_2p)′) are the coefficients associated to each of the p biomarkers in treatment t₁ (resp. t₂) group. When t₁ stands for the standard treatment (placebo), prognostic (resp. predictive) biomarkers are defined as those having non-zero coefficients in β₁ (resp. in β₁ − β₂) and non prognostic (resp. non predictive) biomarkers correspond to the indices having null coefficients in β₁ (resp. in β₁ − β₂). The vector β₁ and β₂ − β₁ are assumed to be sparse, a majority of its components is equal to zero. The goal of the PPLasso approach is to retrieve the indices of the nonzero components of β₁ and β₂ − β₁.

Concerning the biomarkers, are the design matrices for t₁ and t₂ groups, respectively. The rows of X₁ and X₂ are assumed to be the realizations of independent centered Gaussian random vectors having a covariance matrix equal to Σ.

Data generation

Correlation matrix Σ

We consider a correlation matrix having the following block structure:

where Σ₁₁ is the correlation matrix of active variables (non null associated coefficients) with off-diagonal entries equal to a₁, Σ₂₂ is the one of non active variables (null associated coefficients) with off-diagonal entries equal to a₃ and Σ₁₂ is the correlation matrix between active and non active variables with entries equal to a₂. In the following example: (a₁, a₂, a₃) = (0.3, 0.5, 0.7).
The first 10 variables are assumed to be active, among which the first 5 are also predictive.

In the following, p = 50 and n = 50 are used for the example but the approach can handle much larger values of n and p as it is shown in the paper describing PPLasso.

p <- 50 # number of variables 
d <- 10 # number of actives
n <- 50 # number of samples
actives <- 1:d
nonacts <- c(1:p)[-actives]
Sigma <- matrix(0, p, p)
Sigma[actives, actives] <- 0.3
Sigma[-actives, actives] <- 0.5
Sigma[actives, -actives] <- 0.5
Sigma[-actives, -actives] <- 0.7
diag(Sigma) <- rep(1,p)
actives_pred <- 1:5

Generation of X and y

The design matrix is then generated with the correlation matrix Σ previously defined by using the function and the response variable y is generated according to the linear model where the non null components of β₁ are equal to 1 and non null components of β₂ − β₁ are equal to 0.5, α₁ = 0 and α₂ = 1.

X_bm <- MASS::mvrnorm(n = n, mu=rep(0,p), Sigma, tol = 1e-6, empirical = FALSE)
colnames(X_bm) <- paste0("X",(1:p))
n1=n2=n/2 # 1:1 randomized
beta1 <- rep(0,p)
beta1[actives] <- 1
beta2 <- beta1
beta2[actives_pred] <- 2
beta <- c(beta1, beta2)
TRT1 <- c(rep(1,n1), rep(0,n2))
TRT2 <- c(rep(0,n1), rep(1,n2))
Y <- cbind(X_bm*TRT1,X_bm*TRT2)%*%beta+TRT2+rnorm(n,0,1)

Estimation of Σ

Given y and X, we can estimate the block-wise correlation matrix Σ containing the correlations between the columns of X. We propose to use the function of the R package by keeping the default parameters.

cv_cov_est_out <- cvCovEst(
      dat = X_bm,
      estimators = c(
        linearShrinkLWEst, denseLinearShrinkEst,
        thresholdingEst, poetEst, sampleCovEst
      ),
      estimator_params = list(
        thresholdingEst = list(gamma = c(0.2, 0.4)),
        poetEst = list(lambda = c(0.1, 0.2), k = c(1L, 2L))
      ),
      cv_loss = cvMatrixFrobeniusLoss,
      cv_scheme = "v_fold",
      v_folds = 5
    )
Sigma_est <- cov2cor(cv_cov_est_out$estimate)

The optimal estimation of Σ can be obtained by the object in the output.

Variable selection

With the previous X and y, the function of the package can be used to select the active variables. If the parameter (correlation matrix) is not provided, it will be automatically estimated by the function of the R package presented in the previous section. However, it can also be provided by the users. Here we use the previously estimated $\widehat{\boldsymbol{\Sigma}}$: .

mod <- ProgPredLasso(X1 = X_bm[1:n1, ], X2 = X_bm[(n1+1):n, ], Y = Y, cor_matrix = Sigma_est)

Additional arguments:

: parameter of thresholding appearing in the method described in [1] which is set to 0.95 by default.
: integer specifying the maximum number of steps for the generalized Lasso algorithm. Its default value is 500.

Outputs:

: all the λ considered.
: matrix of the estimations of γ for all the λ considered. Each row of corresponds to γ̂ for a given λ. More precisely, the first (resp. second) column corresponds to the estimation of treatment effect α₁ (resp. α2). The 3rd to (p + 2)th columns correspond to the estimation of β₁ and the last p columns correspond to the estimation of β₂ − β₁.
: estimation of γ obtained for the λ minimizing the BIC criterion.
: BIC criterion for all the λ considered.
: MSE (Mean Squared Error) for all the λ considered.

Estimation of γ

The estimation of the treatment effects α₁ and α₂ are obtained as follows:

#alpha1
 mod$beta.min[1]

## [1] 0.116741

#alpha2 
 mod$beta.min[2]

## [1] 0.910072

The identified prognostic (resp. predictive) biomarkers are displayed on the left (resp. right) of Figure with true prognostic or predictive biomarkers in blue and false positives in red.

$\label{fig:fig1}Left: Identified prognostic biomarkers. Right: Identified predictive biomarkers.$ $\label{fig:fig1}Left: Identified prognostic biomarkers. Right: Identified predictive biomarkers.$

Left: Identified prognostic biomarkers. Right: Identified predictive biomarkers.

To find the biomarkers identified as prognostic:

which(beta_min[1:p]!=0)

##  [1]  1  2  3  4  5  6  7  8  9 10 12 13 18 20 21 23 28 32 34 36 40 42 43 44 45
## [26] 46 49

and the biomarkers identified as predictive:

which(beta_min[(p+1):(2*p)]!=0)

##  [1]  1  3  4  5  8 21 22 27 30 32 37 39 43

[1] W. Zhu, C. Lévy-Leduc, N. Ternès. Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso, 2022, Arxiv: 2202.01970.