Title: | Fused Lasso Approach in Regression Coefficient Clustering |
---|---|
Description: | Fused lasso method to cluster and estimate regression coefficients of the same covariate across different data sets when a large number of independent data sets are combined. Package supports Gaussian, binomial, Poisson and Cox PH models. |
Authors: | Lu Tang, Ling Zhou, Peter X.K. Song |
Maintainer: | Lu Tang <[email protected]> |
License: | GPL-2 |
Version: | 2.0-1 |
Built: | 2024-10-29 06:29:33 UTC |
Source: | CRAN |
Fused lasso method to cluster and estimate regression coefficients of the same covariate across different data sets when a large number of independent data sets are combined. Package supports Gaussian, binomial, Poisson and Cox PH models.
Simple to use. Accepts X
, y
, and sid
(numerica data source ID for which data entry belongs to) for regression models. Returns regression coefficient estimates and clusterings patterns of coefficients across different datasets, for each covariate. Provides visualization by fusogram, a dendrogram-type of presentation of coefficient clustering pattern across data sources.
Lu Tang, Ling Zhou, Peter X.K. Song
Maintainer: Lu Tang <[email protected]>
Lu Tang, and Peter X.K. Song. Fused Lasso Approach in Regression Coefficients Clustering - Learning Parameter Heterogeneity in Data Integration. Journal of Machine Learning Research, 17(113):1-23, 2016.
Fei Wang, Lu Wang, and Peter X.K. Song. Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements. Biometrics, DOI:10.1111/biom.12496, 2016.
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) # prepare the input for metafuse y <- data$y sid <- data$group X <- data[,-c(1,ncol(data))] ########### run metafuse ########### # fuse slopes of X1 (which is heterogeneous with 2 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(1), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse slopes of X2 (which is heterogeneous with 3 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates, with sparsity penalty metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fit metafuse at a given lambda metafuse.l(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, lambda=0.5)
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) # prepare the input for metafuse y <- data$y sid <- data$group X <- data[,-c(1,ncol(data))] ########### run metafuse ########### # fuse slopes of X1 (which is heterogeneous with 2 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(1), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse slopes of X2 (which is heterogeneous with 3 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates, with sparsity penalty metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fit metafuse at a given lambda metafuse.l(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, lambda=0.5)
Simulate a dataset with data from K
different sources, for demonstration of metafuse
.
datagenerator(n, beta0, family, seed = NA)
datagenerator(n, beta0, family, seed = NA)
n |
a vector of length |
beta0 |
a coefficient matrix of dimension |
family |
the type of the response vector, |
seed |
the random seed for data generation, default is |
These datasets are artifical, and are used to demonstrate the features of metafuse
. In the case when family="cox"
, the response will contain two vectors, a time-to-event variable time
and a censoring indicator status
.
Returns data frame with n*K
rows (if n
is a scalar), or sum(n)
rows (if n
is a K
-element vector). The data frame contains columns "y", "x1", ..., "x_p-1" and "group" if family="gaussian"
, "binomial"
or "poisson"
; or contains columns "time", "status", "x1", ..., "x_p-1" and "group" if family="cox"
.
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) names(data) # if family="cox", returned dataset contains columns "time"" and "status" instead of "y" data <- datagenerator(n=n, beta0=beta0, family="cox", seed=123) names(data)
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) names(data) # if family="cox", returned dataset contains columns "time"" and "status" instead of "y" data <- datagenerator(n=n, beta0=beta0, family="cox", seed=123) names(data)
Fit a GLM with fusion penalty on coefficients within each covariate across datasets, generate solution path and fusograms for visualization of the model selection.
metafuse(X = X, y = y, sid = sid, fuse.which = c(0:ncol(X)), family = "gaussian", intercept = TRUE, alpha = 0, criterion = "EBIC", verbose = TRUE, plots = FALSE, loglambda = TRUE)
metafuse(X = X, y = y, sid = sid, fuse.which = c(0:ncol(X)), family = "gaussian", intercept = TRUE, alpha = 0, criterion = "EBIC", verbose = TRUE, plots = FALSE, loglambda = TRUE)
X |
a matrix (or vector) of predictor(s), with dimensions of |
y |
a vector of response, with length |
sid |
data source ID of length |
fuse.which |
a vector of integers from 0 to |
family |
response vector type, |
intercept |
if |
alpha |
the ratio of sparsity penalty to fusion penalty, default is 0 (i.e., no variable selection, only fusion) |
criterion |
|
verbose |
if |
plots |
if |
loglambda |
if |
Adaptive lasso penalty is used. See Zou (2006) for detail.
A list containing the following items will be returned:
family |
the response/model type |
criterion |
model selection criterion used |
alpha |
the ratio of sparsity penalty to fusion penalty |
if.fuse |
whether covariate is assumed to be heterogeneous (1) or homogeneous (0) |
betahat |
the estimated regression coefficients |
betainfo |
additional information about the fit, including degree of freedom, optimal lambda value, maximum lambda value to fuse all coefficients, and estimated friction of fusion |
Lu Tang, and Peter X.K. Song. Fused Lasso Approach in Regression Coefficients Clustering - Learning Parameter Heterogeneity in Data Integration. Journal of Machine Learning Research, 17(113):1-23, 2016.
Fei Wang, Lu Wang, and Peter X.K. Song. Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements. Biometrics, DOI:10.1111/biom.12496, 2016.
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) # prepare the input for metafuse y <- data$y sid <- data$group X <- data[,-c(1,ncol(data))] ########### run metafuse ########### # fuse slopes of X1 (which is heterogeneous with 2 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(1), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse slopes of X2 (which is heterogeneous with 3 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates, with sparsity penalty metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE)
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) # prepare the input for metafuse y <- data$y sid <- data$group X <- data[,-c(1,ncol(data))] ########### run metafuse ########### # fuse slopes of X1 (which is heterogeneous with 2 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(1), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse slopes of X2 (which is heterogeneous with 3 clusters) metafuse(X=X, y=y, sid=sid, fuse.which=c(2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=0, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE) # fuse all three covariates, with sparsity penalty metafuse(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, criterion="EBIC", verbose=TRUE, plots=TRUE, loglambda=TRUE)
Fit a GLM with fusion penalty on coefficients within each covariate at given lambda.
metafuse.l(X = X, y = y, sid = sid, fuse.which = c(0:ncol(X)), family = "gaussian", intercept = TRUE, alpha = 0, lambda = lambda)
metafuse.l(X = X, y = y, sid = sid, fuse.which = c(0:ncol(X)), family = "gaussian", intercept = TRUE, alpha = 0, lambda = lambda)
X |
a matrix (or vector) of predictor(s), with dimensions of |
y |
a vector of response, with length |
sid |
data source ID of length |
fuse.which |
a vector of integers from 0 to |
family |
response vector type, |
intercept |
if |
alpha |
the ratio of sparsity penalty to fusion penalty, default is 0 (i.e., no variable selection, only fusion) |
lambda |
tuning parameter for fusion penalty |
Adaptive lasso penalty is used. See Zou (2006) for detail.
A list containing the following items will be returned:
family |
the response/model type |
alpha |
the ratio of sparsity penalty to fusion penalty |
if.fuse |
whether covariate is assumed to be heterogeneous (1) or homogeneous (0) |
betahat |
the estimated regression coefficients |
betainfo |
additional information about the fit, including degree of freedom, optimal lambda value, maximum lambda value to fuse all coefficients, and estimated friction of fusion |
Lu Tang, and Peter X.K. Song. Fused Lasso Approach in Regression Coefficients Clustering - Learning Parameter Heterogeneity in Data Integration. Journal of Machine Learning Research, 17(113):1-23, 2016.
Fei Wang, Lu Wang, and Peter X.K. Song. Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements. Biometrics, DOI:10.1111/biom.12496, 2016.
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) # prepare the input for metafuse y <- data$y sid <- data$group X <- data[,-c(1,ncol(data))] ########### run metafuse ########### # fit metafuse at a given lambda metafuse.l(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, lambda=0.5)
########### generate data ########### n <- 200 # sample size in each dataset (can also be a K-element vector) K <- 10 # number of datasets for data integration p <- 3 # number of covariates in X (including the intercept) # the coefficient matrix of dimension K * p, used to specify the heterogeneous pattern beta0 <- matrix(c(0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, # beta_0 of intercept 0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0, # beta_1 of X_1 0.0,0.0,0.0,0.0,0.5,0.5,0.5,1.0,1.0,1.0), # beta_2 of X_2 K, p) # generate a data set, family=c("gaussian", "binomial", "poisson", "cox") data <- datagenerator(n=n, beta0=beta0, family="gaussian", seed=123) # prepare the input for metafuse y <- data$y sid <- data$group X <- data[,-c(1,ncol(data))] ########### run metafuse ########### # fit metafuse at a given lambda metafuse.l(X=X, y=y, sid=sid, fuse.which=c(0,1,2), family="gaussian", intercept=TRUE, alpha=1, lambda=0.5)