Title: | Privacy-Preserving Distributed Algorithms |
---|---|
Description: | A collection of privacy-preserving distributed algorithms for conducting multi-site data analyses. The regression analyses can be linear regression for continuous outcome, logistic regression for binary outcome, Cox proportional hazard regression for time-to event outcome, Poisson regression for count outcome, or multi-categorical regression for nominal or ordinal outcome. The PDA algorithm runs on a lead site and only requires summary statistics from collaborating sites, with one or few iterations. The package can be used together with the online system (<https://pda-ota.pdamethods.org/>) for safe and convenient collaboration. For more information, please visit our software websites: <https://github.com/Penncil/pda>, and <https://pdamethods.org/>. |
Authors: | Chongliang Luo [aut], Rui Duan [aut], Mackenzie Edmondson [aut], Jiayi Tong [aut], Xiaokang Liu [aut], Kenneth Locke [aut], Jiajie Chen [cre], Yong Chen [aut], Penn Computing Inference Learning (PennCIL) lab [cph] |
Maintainer: | Jiajie Chen <[email protected]> |
License: | Apache License 2.0 |
Version: | 1.2.7 |
Built: | 2024-10-31 06:59:33 UTC |
Source: | CRAN |
A simulated data set for ADAP demonstration
ADAP_data
ADAP_data
A list containing the following elements:
site id, 300 'site1', 300 'site2', 300 'site3'
binary outcome of length 900
900 by 49 matrix generated by standard normal distribution, representing the covariates
A simulated data set of hospitalization Length of Stay (LOS) and mortality from 6 sites
covid
covid
A data frame with 2100 rows and 6 variables:
site id, 600 'site1', 500 'site2', 400 'site3', 300 'site4', 200 'site5', 100 'site6'
continuous age in year, min 3 max 97
2 categories, '1' for male and '0' for female
lab test results, continuous value ranging from 2.3 to 97.4
LOS in days, ranging from 1 to 29
mortality status, '1' for death and '0' for alive.
A data set modified from the CrabSatellites data in countreg package (see demo(ODAH)).
cs
cs
A data frame containing 173 observations on 4 variables.
Simulated site id, 85 'site1' and 88 'site2'.
Number of satellites. Treated as (zero-inflated) count outcome in ODAH
Carapace width (cm).
Weight (kg).
https://rdrr.io/rforge/countreg/man/CrabSatellites.html
gather cloud settings into a list
getCloudConfig(site_id,dir,uri,secret)
getCloudConfig(site_id,dir,uri,secret)
site_id |
site identifier |
dir |
shared directory path if flat files |
uri |
web uri if web service |
secret |
web token if web service |
A list of cloud parameters: site_id, secret and uri
pda
A simulated data set of hospitalization Length of Stay (LOS) from 3 sites
LOS
LOS
A data frame with 1000 rows and 5 variables:
site id, 500 'site1', 400 'site2' and 100 'site3'
3 categories, 'young', 'middle', and 'old'
2 categories, 'M' for male and 'F' for female
lab test results, continuous value ranging from 0 to 100
LOS in days, ranging from 1 tp 28. Treated as continuous outcome in DLM
A data set modified from the lung data in survival package (see demo(ODAC)).
lung2
lung2
A data frame with 228 rows and 5 variables:
simulated site id, 86 'site1', 83 'site2' and 59 'site3'
survival time in days
censoring status 0=censored, 1=dead
age in years
1 for female and 0 for male
https://CRAN.R-project.org/package=survival
A simulated data set for ODACAT demonstration
ODACAT_nominal
ODACAT_nominal
A data frame with 300 rows and 5 variables:
site id, 102 'site1', 100 'site2', 98 'site3'
3-category outcome, possible values are 1,2,3. Category 3 will be used as reference
the first covariate, continuous
the second covariate, binary
the third covariate, binary
A simulated data set for ODACAT demonstration
ODACAT_ordinal
ODACAT_ordinal
A data frame with 300 rows and 5 variables:
site id, 105 'site1', 105 'site2', 90 'site3'
3-category outcome, possible values are 1,2,3. Category 3 will be used as reference
the first covariate, continuous
the second covariate, binary
the third covariate, binary
Fit Privacy-preserving Distributed Algorithms for linear, logistic, Poisson and Cox PH regression with possible heterogeneous data across sites.
pda(ipdata,site_id,control,dir,uri,secret,hosdata)
pda(ipdata,site_id,control,dir,uri,secret,hosdata)
ipdata |
Local IPD data in data frame, should include at least one column for the outcome and one column for the covariates |
site_id |
Character site name |
control |
pda control data |
dir |
directory for shared flat file cloud |
uri |
Universal Resource Identifier for this run |
secret |
password to authenticate as site_id on uri |
hosdata |
hospital-level data, should include the same name as defined in the control file |
control
control
Michael I. Jordan, Jason D. Lee & Yun Yang (2019) Communication-Efficient Distributed Statistical Inference,
Journal of the American Statistical Association, 114:526, 668-681
doi:10.1080/01621459.2018.1429274.
(DLM) Yixin Chen, et al. (2006) Regression cubes with lossless compression and aggregation.
IEEE Transactions on Knowledge and Data Engineering, 18(12), pp.1585-1599.
(DLMM) Chongliang Luo, et al. (2020) Lossless Distributed Linear Mixed Model with Application to Integration of Heterogeneous Healthcare Data.
medRxiv, doi:10.1101/2020.11.16.20230730.
(DPQL) Chongliang Luo, et al. (2021) dPQL: a lossless distributed algorithm for generalized linear mixed model with application to privacy-preserving hospital profiling.
medRxiv, doi:10.1101/2021.05.03.21256561.
(ODAL) Rui Duan, et al. (2020) Learning from electronic health records across multiple sites:
A communication-efficient and privacy-preserving distributed algorithm.
Journal of the American Medical Informatics Association, 27.3:376–385,
doi:10.1093/jamia/ocz199.
(ODAC) Rui Duan, et al. (2020) Learning from local to global: An efficient distributed algorithm for modeling time-to-event data.
Journal of the American Medical Informatics Association, 27.7:1028–1036,
doi:10.1093/jamia/ocaa044.
(ODACH) Chongliang Luo, et al. (2021) ODACH: A One-shot Distributed Algorithm for Cox model with Heterogeneous Multi-center Data.
medRxiv, doi:10.1101/2021.04.18.21255694.
(ODAH) Mackenzie J. Edmondson, et al. (2021) An Efficient and Accurate Distributed Learning Algorithm for Modeling Multi-Site Zero-Inflated Count Outcomes.
medRxiv, pp.2020-12.
doi:10.1101/2020.12.17.20248194.
(ADAP) Xiaokang Liu, et al. (2021) ADAP: multisite learning with high-dimensional heterogeneous data via A Distributed Algorithm for Penalized regression.
(dGEM) Jiayi Tong, et al. (2022) dGEM: Decentralized Generalized Linear Mixed Effects Model
pdaPut
, pdaList
, pdaGet
, getCloudConfig
and pdaSync
.
require(survival) require(data.table) require(pda) data(lung) ## In the toy example below we aim to analyze the association of lung status with ## age and sex using logistic regression, data(lung) from 'survival', we randomly ## assign to 3 sites: 'site1', 'site2', 'site3'. we demonstrate using PDA ODAL can ## obtain a surrogate estimator that is close to the pooled estimate. We run the ## example in local directory. In actual collaboration, account/password for pda server ## will be assigned to the sites at the server https://pda.one. ## Each site can access via web browser to check the communication of the summary stats. ## for more examples, see demo(ODAC) and demo(ODAP) # Create 3 sites, split the lung data amongst them sites = c('site1', 'site2', 'site3') set.seed(42) lung2 <- lung[,c('status', 'age', 'sex')] lung2$sex <- lung2$sex - 1 lung2$status <- ifelse(lung2$status == 2, 1, 0) lung_split <- split(lung2, sample(1:length(sites), nrow(lung), replace=TRUE)) ## fit logistic reg using pooled data fit.pool <- glm(status ~ age + sex, family = 'binomial', data = lung2) # ############################ STEP 1: initialize ############################### control <- list(project_name = 'Lung cancer study', step = 'initialize', sites = sites, heterogeneity = FALSE, model = 'ODAL', family = 'binomial', outcome = "status", variables = c('age', 'sex'), optim_maxit = 100, lead_site = 'site1', upload_date = as.character(Sys.time()) ) ## run the example in local directory: ## specify your working directory, default is the tempdir mydir <- tempdir() ## assume lead site1: enter "1" to allow transferring the control file pda(site_id = 'site1', control = control, dir = mydir) ## in actual collaboration, account/password for pda server will be assigned, thus: ## Not run: pda(site_id = 'site1', control = control, uri = 'https://pda.one', secret='abc123') ## you can also set your environment variables, and no need to specify them in pda: ## Not run: Sys.setenv(PDA_USER = 'site1', PDA_SECRET = 'abc123', PDA_URI = 'https://pda.one') ## Not run: pda(site_id = 'site1', control = control) ##' assume remote site3: enter "1" to allow tranferring your local estimate pda(site_id = 'site3', ipdata = lung_split[[3]], dir=mydir) ##' assume remote site2: enter "1" to allow tranferring your local estimate pda(site_id = 'site2', ipdata = lung_split[[2]], dir=mydir) ##' assume lead site1: enter "1" to allow tranferring your local estimate ##' control.json is also automatically updated pda(site_id = 'site1', ipdata = lung_split[[1]], dir=mydir) ##' if lead site1 initialized before other sites, ##' lead site1: uncomment to sync the control before STEP 2 ## Not run: pda(site_id = 'site1', control = control) ## Not run: config <- getCloudConfig(site_id = 'site1') ## Not run: pdaSync(config) #' ############################' STEP 2: derivative ############################ ##' assume remote site3: enter "1" to allow tranferring your derivatives pda(site_id = 'site3', ipdata = lung_split[[3]], dir=mydir) ##' assume remote site2: enter "1" to allow tranferring your derivatives pda(site_id = 'site2', ipdata = lung_split[[2]], dir=mydir) ##' assume lead site1: enter "1" to allow tranferring your derivatives pda(site_id = 'site1', ipdata = lung_split[[1]], dir=mydir) #' ############################' STEP 3: estimate ############################ ##' assume lead site1: enter "1" to allow tranferring the surrogate estimate pda(site_id = 'site1', ipdata = lung_split[[1]], dir=mydir) ##' the PDA ODAL is now completed! ##' All the sites can still run their own surrogate estimates and broadcast them. ##' compare the surrogate estimate with the pooled estimate config <- getCloudConfig(site_id = 'site1', dir=mydir) fit.odal <- pdaGet(name = 'site1_estimate', config = config) cbind(b.pool=fit.pool$coef, b.odal=fit.odal$btilde, sd.pool=summary(fit.pool)$coef[,2], sd.odal=sqrt(diag(solve(fit.odal$Htilde)/nrow(lung2)))) ## see demo(ODAL) for more optional steps
require(survival) require(data.table) require(pda) data(lung) ## In the toy example below we aim to analyze the association of lung status with ## age and sex using logistic regression, data(lung) from 'survival', we randomly ## assign to 3 sites: 'site1', 'site2', 'site3'. we demonstrate using PDA ODAL can ## obtain a surrogate estimator that is close to the pooled estimate. We run the ## example in local directory. In actual collaboration, account/password for pda server ## will be assigned to the sites at the server https://pda.one. ## Each site can access via web browser to check the communication of the summary stats. ## for more examples, see demo(ODAC) and demo(ODAP) # Create 3 sites, split the lung data amongst them sites = c('site1', 'site2', 'site3') set.seed(42) lung2 <- lung[,c('status', 'age', 'sex')] lung2$sex <- lung2$sex - 1 lung2$status <- ifelse(lung2$status == 2, 1, 0) lung_split <- split(lung2, sample(1:length(sites), nrow(lung), replace=TRUE)) ## fit logistic reg using pooled data fit.pool <- glm(status ~ age + sex, family = 'binomial', data = lung2) # ############################ STEP 1: initialize ############################### control <- list(project_name = 'Lung cancer study', step = 'initialize', sites = sites, heterogeneity = FALSE, model = 'ODAL', family = 'binomial', outcome = "status", variables = c('age', 'sex'), optim_maxit = 100, lead_site = 'site1', upload_date = as.character(Sys.time()) ) ## run the example in local directory: ## specify your working directory, default is the tempdir mydir <- tempdir() ## assume lead site1: enter "1" to allow transferring the control file pda(site_id = 'site1', control = control, dir = mydir) ## in actual collaboration, account/password for pda server will be assigned, thus: ## Not run: pda(site_id = 'site1', control = control, uri = 'https://pda.one', secret='abc123') ## you can also set your environment variables, and no need to specify them in pda: ## Not run: Sys.setenv(PDA_USER = 'site1', PDA_SECRET = 'abc123', PDA_URI = 'https://pda.one') ## Not run: pda(site_id = 'site1', control = control) ##' assume remote site3: enter "1" to allow tranferring your local estimate pda(site_id = 'site3', ipdata = lung_split[[3]], dir=mydir) ##' assume remote site2: enter "1" to allow tranferring your local estimate pda(site_id = 'site2', ipdata = lung_split[[2]], dir=mydir) ##' assume lead site1: enter "1" to allow tranferring your local estimate ##' control.json is also automatically updated pda(site_id = 'site1', ipdata = lung_split[[1]], dir=mydir) ##' if lead site1 initialized before other sites, ##' lead site1: uncomment to sync the control before STEP 2 ## Not run: pda(site_id = 'site1', control = control) ## Not run: config <- getCloudConfig(site_id = 'site1') ## Not run: pdaSync(config) #' ############################' STEP 2: derivative ############################ ##' assume remote site3: enter "1" to allow tranferring your derivatives pda(site_id = 'site3', ipdata = lung_split[[3]], dir=mydir) ##' assume remote site2: enter "1" to allow tranferring your derivatives pda(site_id = 'site2', ipdata = lung_split[[2]], dir=mydir) ##' assume lead site1: enter "1" to allow tranferring your derivatives pda(site_id = 'site1', ipdata = lung_split[[1]], dir=mydir) #' ############################' STEP 3: estimate ############################ ##' assume lead site1: enter "1" to allow tranferring the surrogate estimate pda(site_id = 'site1', ipdata = lung_split[[1]], dir=mydir) ##' the PDA ODAL is now completed! ##' All the sites can still run their own surrogate estimates and broadcast them. ##' compare the surrogate estimate with the pooled estimate config <- getCloudConfig(site_id = 'site1', dir=mydir) fit.odal <- pdaGet(name = 'site1_estimate', config = config) cbind(b.pool=fit.pool$coef, b.odal=fit.odal$btilde, sd.pool=summary(fit.pool)$coef[,2], sd.odal=sqrt(diag(solve(fit.odal$Htilde)/nrow(lung2)))) ## see demo(ODAL) for more optional steps
Function to download json and return as object
pdaGet(name,config)
pdaGet(name,config)
name |
of file |
config |
cloud configuration |
A list of data objects from the json file on the cloud
pda
Function to list available objects
pdaList(config)
pdaList(config)
config |
a list of variables for cloud configuration |
A list of (json) files on the cloud
pda
Function to upload object to cloud as json
pdaPut(obj,name,config)
pdaPut(obj,name,config)
obj |
R object to encode as json and uploaded to cloud |
name |
of file |
config |
a list of variables for cloud configuration |
NONE
pda
update pda control if ready (run by lead)
pdaSync(config)
pdaSync(config)
config |
cloud configuration |
control
pda