This package provides functions for implementing the PPLasso (Prognostic Predictive Lasso) approach described in [1] to identify prognostic and predictive biomarkers in high dimensional settings. This method is designed by taking into account the correlations that may exist between the biomarkers. It consists in rewriting the initial high-dimensional linear model to remove the correlation existing between the predictors and in applying the generalized Lasso criterion. We refer the reader to the paper for further details.
We suppose that the response variable y satisfy the following linear model: where $$ \boldsymbol{X}=\begin{bmatrix} 1 & 0 & X_{11}^{1} & X_{11}^{2} & \ldots & X_{11}^{p} & 0 & 0 & \ldots & 0 \\ 1 & 0 & X_{12}^{1} & X_{12}^{2} & \ldots & X_{12}^{p} & 0 & 0 & \ldots & 0 \\ \vdots & \vdots & \vdots& \vdots& & \vdots & & & \\ 1 & 0 & X_{1n_{1}}^{1} & X_{1n_{1}}^{2} & \ldots & X_{1n_{1}}^{p} & 0 & 0 & \ldots & 0 \\ 0 & 1 & 0 & 0 & \ldots & 0 & X_{21}^{1} & X_{21}^{2} & \ldots & X_{21}^{p} \\ 0 & 1 & 0 & 0 & \ldots & 0 & X_{22}^{1} & X_{22}^{2} & \ldots & X_{22}^{p} \\ \vdots & \vdots & \vdots& \vdots& & \vdots& \vdots&\vdots & &\vdots\\ 0 & 1 & 0 & 0 & \ldots & 0 & X_{2n_{2}}^{1} & X_{2n_{2}}^{2} & \ldots & X_{2n_{2}}^{p} \end{bmatrix}, $$ with γ = (α1, α2, β1′, β2′)′. α1 (resp. α2) corresponding to the effects of treatment t1 (resp. t2). Moreover, β1 = (β11, β12, …, β1p)′ (resp. β2 = (β21, β22, …, β2p)′) are the coefficients associated to each of the p biomarkers in treatment t1 (resp. t2) group. When t1 stands for the standard treatment (placebo), prognostic (resp. predictive) biomarkers are defined as those having non-zero coefficients in β1 (resp. in β1 − β2) and non prognostic (resp. non predictive) biomarkers correspond to the indices having null coefficients in β1 (resp. in β1 − β2). The vector β1 and β2 − β1 are assumed to be sparse, a majority of its components is equal to zero. The goal of the PPLasso approach is to retrieve the indices of the nonzero components of β1 and β2 − β1.
Concerning the biomarkers, are the design matrices for t1 and t2 groups, respectively. The rows of X1 and X2 are assumed to be the realizations of independent centered Gaussian random vectors having a covariance matrix equal to Σ.
We consider a correlation matrix having the following block structure:
where Σ11 is the
correlation matrix of active variables (non null associated
coefficients) with off-diagonal entries equal to a1, Σ22 is the one of
non active variables (null associated coefficients) with off-diagonal
entries equal to a3
and Σ12 is
the correlation matrix between active and non active variables with
entries equal to a2. In the following
example: (a1, a2, a3) = (0.3, 0.5, 0.7).
The first 10 variables are assumed to be active, among which the first 5
are also predictive.
In the following, p = 50 and n = 50 are used for the example but the approach can handle much larger values of n and p as it is shown in the paper describing PPLasso.
p <- 50 # number of variables
d <- 10 # number of actives
n <- 50 # number of samples
actives <- 1:d
nonacts <- c(1:p)[-actives]
Sigma <- matrix(0, p, p)
Sigma[actives, actives] <- 0.3
Sigma[-actives, actives] <- 0.5
Sigma[actives, -actives] <- 0.5
Sigma[-actives, -actives] <- 0.7
diag(Sigma) <- rep(1,p)
actives_pred <- 1:5
The design matrix is then generated with the correlation matrix Σ previously defined by using the function and the response variable y is generated according to the linear model where the non null components of β1 are equal to 1 and non null components of β2 − β1 are equal to 0.5, α1 = 0 and α2 = 1.
X_bm <- MASS::mvrnorm(n = n, mu=rep(0,p), Sigma, tol = 1e-6, empirical = FALSE)
colnames(X_bm) <- paste0("X",(1:p))
n1=n2=n/2 # 1:1 randomized
beta1 <- rep(0,p)
beta1[actives] <- 1
beta2 <- beta1
beta2[actives_pred] <- 2
beta <- c(beta1, beta2)
TRT1 <- c(rep(1,n1), rep(0,n2))
TRT2 <- c(rep(0,n1), rep(1,n2))
Y <- cbind(X_bm*TRT1,X_bm*TRT2)%*%beta+TRT2+rnorm(n,0,1)
Given y and X, we can estimate the block-wise correlation matrix Σ containing the correlations between the columns of X. We propose to use the function of the R package by keeping the default parameters.
cv_cov_est_out <- cvCovEst(
dat = X_bm,
estimators = c(
linearShrinkLWEst, denseLinearShrinkEst,
thresholdingEst, poetEst, sampleCovEst
),
estimator_params = list(
thresholdingEst = list(gamma = c(0.2, 0.4)),
poetEst = list(lambda = c(0.1, 0.2), k = c(1L, 2L))
),
cv_loss = cvMatrixFrobeniusLoss,
cv_scheme = "v_fold",
v_folds = 5
)
Sigma_est <- cov2cor(cv_cov_est_out$estimate)
The optimal estimation of Σ can be obtained by the object in the output.
With the previous X and y, the function of the package can be used to select the active variables. If the parameter (correlation matrix) is not provided, it will be automatically estimated by the function of the R package presented in the previous section. However, it can also be provided by the users. Here we use the previously estimated $\widehat{\boldsymbol{\Sigma}}$: .
Additional arguments:
Outputs:
The estimation of the treatment effects α1 and α2 are obtained as follows:
## [1] 0.116741
## [1] 0.910072
The identified prognostic (resp. predictive) biomarkers are displayed on the left (resp. right) of Figure with true prognostic or predictive biomarkers in blue and false positives in red.
Left: Identified prognostic biomarkers. Right: Identified predictive biomarkers.
To find the biomarkers identified as prognostic:
## [1] 1 2 3 4 5 6 7 8 9 10 12 13 18 20 21 23 28 32 34 36 40 42 43 44 45
## [26] 46 49
and the biomarkers identified as predictive:
## [1] 1 3 4 5 8 21 22 27 30 32 37 39 43
[1] W. Zhu, C. Lévy-Leduc, N. Ternès. Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso, 2022, Arxiv: 2202.01970.