Title: | Linkage Using Diagnosis Codes |
---|---|
Description: | Probabilistic record linkage without direct identifiers using only diagnosis codes. Method is detailed in: Hejblum, Weber, Liao, Palmer, Churchill, Szolovits, Murphy, Kohane & Cai (2019) <doi: 10.1038/sdata.2018.298> ; Zhang, Hejblum, Weber, Palmer, Churchill, Szolovits, Murphy, Liao, Kohane & Cai (2021) <doi: 10.1101/2021.05.02.21256490>. |
Authors: | Boris P Hejblum [aut, cre], Harrison G Zhang [aut], Tianxi Cai [aut] |
Maintainer: | Boris P Hejblum <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2024-11-08 06:46:14 UTC |
Source: | CRAN |
Linkage Using Diagnosis Codes
This package implements probabilistic record linkage methods that relies on the use of diagnosis codes only, in the absence of direct identifiers .
Package: | ludic |
Type: | Package |
Version: | ludic 0.2.0 |
Date: | 2021-08-18 |
License: | The "MIT License" (MIT) |
The main function of ludic
is recordLink
.
Boris P. Hejblum, Tianxi Cai — Maintainer: Boris P. Hejblum
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi:10.1038/sdata.2018.298.
Zhang HG, Hejblum BP, Weber G, Palmer N, Churchill S, Szolovits P, Murphy S, Liao KP, Kohane I and Cai T, ATLAS: An automated association test using probabilistically linked health records with application to genetic studies, JAMIA, in press (2021). doi:10.1101/2021.05.02.21256490.
agree_C_sparse
uses sparse matrices.
agree_C(mat_A, mat_B) agree_C_sparse(mat_A, mat_B)
agree_C(mat_A, mat_B) agree_C_sparse(mat_A, mat_B)
mat_A |
a |
mat_B |
a |
mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100) mat2 <- rbind(mat1[1:10, ], matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90) ) rownames(mat1) <- paste0("A", 1:nrow(mat1)) rownames(mat1) <- paste0("B", 1:nrow(mat1)) mat1 <- 1*(mat1>1) mat2 <- 1*(mat2>1)
mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100) mat2 <- rbind(mat1[1:10, ], matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90) ) rownames(mat1) <- paste0("A", 1:nrow(mat1)) rownames(mat1) <- paste0("B", 1:nrow(mat1)) mat1 <- 1*(mat1>1) mat2 <- 1*(mat2>1)
Computes association test p-values from a generalized linear model for each considered threshold, and computes a p-value for the combination of all the envisioned thresholds through Fisher's method using perturbation resampling.
atlas( match_prob, y, x, covar = NULL, thresholds = seq(from = 0.1, to = 0.9, by = 0.2), nb_perturb = 200, dist_family = c("gaussian", "binomial"), impute_strategy = c("weighted average", "best") )
atlas( match_prob, y, x, covar = NULL, thresholds = seq(from = 0.1, to = 0.9, by = 0.2), nb_perturb = 200, dist_family = c("gaussian", "binomial"), impute_strategy = c("weighted average", "best") )
match_prob |
matching probabilities matrix (e.g. obtained through |
y |
response variable of length |
x |
a |
covar |
a |
thresholds |
a vector (possibly of length |
nb_perturb |
the number of perturbation used for the p-value combination. Default is 200. |
dist_family |
a character string indicating the distribution family for the glm.
Currently, only |
impute_strategy |
a character string indicating which strategy to use to impute x
from the matching probabilities |
a list containing the following:
influencefn_pvals
p-values obtained from influence function perturbations
with the covariates as columns and the thresholds
as rows, with an additional row
at the top for the combination
wald_pvals
a matrix containing the p-values obtained from the Wald
test with the covariates as columns and the thresholds
as rows
ptbed_pvals
a list containing, for each covariates, a matrix with
the nb_perturb
perturbed p-values with the different thresholds
as rows
theta_impute
a matrix of the estimated coefficients from the glm when imputing
the weighted average for covariates (as columns) with the thresholds
as rows
sd_theta
a matrix of the estimated SD (from the influence function) of the
coefficients from the glm when imputing the weighted average for covariates (as columns),
with the thresholds
as rows
ptbed_theta_impute
a list containing, for each covariates, a matrix with
the nb_perturb
perturbed estimated coefficients from the glm when imputing
the weighted average for covariates, with the different thresholds
as rows
impute_strategy
a character string indicating which impute
strategy was used (either "weighted average"
or "best"
)
Zhang HG, Hejblum BP, Weber G, Palmer N, Churchill S, Szolovits P, Murphy S, Liao KP, Kohane I and Cai T, ATLAS: An automated association test using probabilistically linked health records with application to genetic studies, JAMIA, in press (2021). doi:10.1101/2021.05.02.21256490.
#rm(list=ls()) n_sims <- 1#5000 mysim <- function(i){ x <- matrix(ncol=2, nrow=99, stats::rnorm(n=99*2)) #plot(density(rbeta(n=1000, 1,2))) match_prob <- matrix(rbeta(n=103*99, 1, 2), nrow=103, ncol=99) #y <- rnorm(n=103, mean = 1, sd = 0.5) #return(atlas(match_prob, y, x, dist_family="gaussian")$influencefn_pvals) y <- rbinom(n=103, size = 1, prob=0.5) return(atlas(match_prob, y, x, dist_family="binomial")$influencefn_pvals) } #res <- pbapply::pblapply(1:n_sims, mysim, cl = parallel::detectCores()-1) res <- lapply(1:n_sims, mysim) size <- sapply(1:(ncol(res[[1]])-2), FUN = function(i){ rowMeans(sapply(res, function(m){m[, i]<0.05}), na.rm = TRUE) } ) rownames(size) <- rownames(res[[1]]) colnames(size) <- colnames(res[[1]])[-(-1:0 + ncol(res[[1]]))] size
#rm(list=ls()) n_sims <- 1#5000 mysim <- function(i){ x <- matrix(ncol=2, nrow=99, stats::rnorm(n=99*2)) #plot(density(rbeta(n=1000, 1,2))) match_prob <- matrix(rbeta(n=103*99, 1, 2), nrow=103, ncol=99) #y <- rnorm(n=103, mean = 1, sd = 0.5) #return(atlas(match_prob, y, x, dist_family="gaussian")$influencefn_pvals) y <- rbinom(n=103, size = 1, prob=0.5) return(atlas(match_prob, y, x, dist_family="binomial")$influencefn_pvals) } #res <- pbapply::pblapply(1:n_sims, mysim, cl = parallel::detectCores()-1) res <- lapply(1:n_sims, mysim) size <- sapply(1:(ncol(res[[1]])-2), FUN = function(i){ rowMeans(sapply(res, function(m){m[, i]<0.05}), na.rm = TRUE) } ) rownames(size) <- rownames(res[[1]]) colnames(size) <- colnames(res[[1]])[-(-1:0 + ncol(res[[1]]))] size
Compute the negative of the log-sum for a vector of p-values.
comb_pvals(pv)
comb_pvals(pv)
pv |
the vector of p-values to be combined together |
According to Fisher's rule, if the p-values are correlated, then this does not follow a simple chi-square mixture under the null.
the Fisher combination of the p-values. See Details.
em_winkler_big
implements the same method when the data are too big to compute
the agreement matrix. Agreement is then recomputed on the fly each time it is needed. The EM steps
are completely done in C++. This decreases the RAM usage (still important though), at the cost of
increasing computational time.
em_winkler( data1, data2, tol = 0.001, maxit = 500, do_plot = TRUE, oneone = FALSE, verbose = FALSE ) em_winkler_big( data1, data2, tol = 0.001, maxit = 500, do_plot = TRUE, oneone = FALSE, verbose = FALSE )
em_winkler( data1, data2, tol = 0.001, maxit = 500, do_plot = TRUE, oneone = FALSE, verbose = FALSE ) em_winkler_big( data1, data2, tol = 0.001, maxit = 500, do_plot = TRUE, oneone = FALSE, verbose = FALSE )
data1 |
either a binary ( |
data2 |
either a binary ( |
tol |
tolerance for the EM algorithm convergence. |
maxit |
maximum number of iterations for the EM algorithm. |
do_plot |
a logical flag indicating whether a plot should be drawn for the EM convergence.
Default is |
oneone |
a logical flag indicating whether 1-1 matching should be enforced.
If |
verbose |
a logical flag indicating whether intermediate values from the EM algorithm should
be printed. Useful for debugging. Default is |
a list containing:
matchingScore
a matrix of size n1 x n2
with the matching score for each n1*n2
pair.
threshold_ms
threshold value for the matching scores above which pairs are considered true matches.
estim_nbmatch
an estimation of the number of true matches (N
pairs
considered multiplied by p
the estimated proportion of true matches from the EM algorithm)
convergence_status
a logical flag indicating whether the EM algorithm converged
Winkler WE. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proc Sect Surv Res Methods, Am Stat Assoc 1988: 667-71.
Grannis SJ, Overhage JM, Hui S, et al. Analysis of a probabilistic record linkage technique without human review. AMIA 2003 Symp Proc 2003: 259-63.
mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100) mat2 <- rbind(mat1[1:10, ], matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90) ) rownames(mat1) <- paste0("A", 1:nrow(mat1)) rownames(mat1) <- paste0("B", 1:nrow(mat1)) mat1 <- 1*(mat1>1) mat2 <- 1*(mat2>1) em_winkler(mat1, mat2)
mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100) mat2 <- rbind(mat1[1:10, ], matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90) ) rownames(mat1) <- paste0("A", 1:nrow(mat1)) rownames(mat1) <- paste0("B", 1:nrow(mat1)) mat1 <- 1*(mat1>1) mat2 <- 1*(mat2>1) em_winkler(mat1, mat2)
loglikC_bin
implements an even faster C++ implementation of the pseudo-likelihood computation for binary
variables
loglikC_bin_wDates
implements a C++ implementation of the pseudo-likelihood computation for binary
variables with dates
loglikC_bin(Bmat, Amat, eps_p, eps_n, piA, piB) loglikC_bin_wDates(Bmat, Amat, Bdates, Adates, eps_p, eps_n, piA, piB) loglikratioC_diff_arbitrary(Bmat, Amat, d_max, cost)
loglikC_bin(Bmat, Amat, eps_p, eps_n, piA, piB) loglikC_bin_wDates(Bmat, Amat, Bdates, Adates, eps_p, eps_n, piA, piB) loglikratioC_diff_arbitrary(Bmat, Amat, d_max, cost)
Bmat |
|
Amat |
|
eps_p |
a vector of length |
eps_n |
a vector of length |
piA |
a vector of length |
piB |
a vector of length |
Bdates |
|
Adates |
|
d_max |
a numeric vector of length |
cost |
a numeric vector of length |
matchingScore_C_sparse_big
implements a version using sparse matrices. It has a better
management of memory but is a little bit slower (indicated for big matrices)
matchingScore_C(agreemat, m, u, nA, nB) matchingScore_C_sparse_big(mat_A, mat_B, m, u)
matchingScore_C(agreemat, m, u, nA, nB) matchingScore_C_sparse_big(mat_A, mat_B, m, u)
agreemat |
binary sparse matrix of dimensions |
m |
vector of length |
u |
vector of length |
nA |
integer indicating the number of observations to be matched. |
nB |
integer indicating the number of observations to be matched with. |
mat_A |
a |
mat_B |
a |
C++ version: for each observations in (1:n)
, all the matching probabilities are computed
for the p
possible pairs.
matchProbs_rank_full_C(computed_dist, prop_match)
matchProbs_rank_full_C(computed_dist, prop_match)
computed_dist |
a |
prop_match |
a priori proportion of matches ("rho_1") |
a n x p
matrix containing the matching probabilities for each pair
Compute p-values for a Z-score assuming normal distribution of the z-score under the null Hypothesis H0
pval_zscore(beta, sigma)
pval_zscore(beta, sigma)
beta |
the estimate |
sigma |
estimate's estimated variance |
the p-value
An anonymized version of the binarized diagnosis code data from the RA1 and RA2 datasets, over both 6-year and 11-year time span.
data(RA)
data(RA)
5 objects
RA1_6y
: an integer matrix of 0s and 1s containing 4,936
renamed diagnosis codes for 26,681 patients from the dataset RA1 recorded
over a 6-year time span.
RA2_6y
: an integer matrix of 0s and 1s containing 4,936
renamed diagnosis codes for 5,707 patients from the dataset RA2 recorded
over a 6-year time span.
RA1_11y
: an integer matrix of 0s and 1s containing 5,593
renamed diagnosis codes for 26,687 patients from the dataset RA1 recorded
over a 11-year time span.
RA2_11y
: an integer matrix of 0s and 1s containing 5,593
renamed diagnosis codes for 6,394 patients from the dataset RA2 recorded
over a 11-year time span.
silverstandard_truematches
: a character matrix with two
columns containing the identifiers of the 3,831 pairs of silver-standard
matches.
The ICD-9 diagnosis codes have also been masked and randomly reordered, replaced by meaningless names. Finally, the silver-standard matching pairs are also provided to allow the benchmarking of methods for probabilistic record linkage using diagnosis codes.
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi:10.1038/sdata.2018.298.
Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care & Research 62, 1120-1127 (2010). doi:10.1002/acr.20184
Liao, K. P. et al. Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts. PLoS ONE 10, e0136651 (2015). doi:10.1371/journal.pone.0136651
if(interactive()){ rm(list=ls()) library(ludic) data(RA) res_match_6y <- recordLink(data1 = RA1_6y, data2 = RA2_6y, eps_plus = 0.01, eps_minus = 0.01, aggreg_2ways ="mean", min_prev = 0, use_diff = FALSE) res_match_11y <- recordLink(data1 = RA1_11y, data2 = RA2_11y, eps_plus = 0.01, eps_minus = 0.01, aggreg_2ways ="mean", min_prev = 0, use_diff = FALSE) print.res_matching <- function(res, threshold=0.9, ref=silverstandard_truematches){ have_match_row <- rowSums(res>threshold) have_match_col <- colSums(res>threshold) bestmatched_pairs_all <- cbind.data.frame( "D1"=rownames(res)[apply(res[,which(have_match_col>0), drop=FALSE], 2, which.max)], "D2"=names(have_match_col)[which(have_match_col>0)] ) nTM_all <- nrow(ref) nP_all <- nrow(bestmatched_pairs_all) TPR_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="") %in% apply(ref, 1, paste0, collapse=""))/nTM_all PPV_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="") %in% apply(ref, 1, paste0, collapse=""))/nP_all cat("threshold: ", threshold, "\nnb matched: ", nP_all,"; nb true matches: ", nTM_all, "\nTPR: ", TPR_all, "; PPV: ", PPV_all, "\n\n", sep="") } print.res_matching(res_match_6y) print.res_matching(res_match_11y) }
if(interactive()){ rm(list=ls()) library(ludic) data(RA) res_match_6y <- recordLink(data1 = RA1_6y, data2 = RA2_6y, eps_plus = 0.01, eps_minus = 0.01, aggreg_2ways ="mean", min_prev = 0, use_diff = FALSE) res_match_11y <- recordLink(data1 = RA1_11y, data2 = RA2_11y, eps_plus = 0.01, eps_minus = 0.01, aggreg_2ways ="mean", min_prev = 0, use_diff = FALSE) print.res_matching <- function(res, threshold=0.9, ref=silverstandard_truematches){ have_match_row <- rowSums(res>threshold) have_match_col <- colSums(res>threshold) bestmatched_pairs_all <- cbind.data.frame( "D1"=rownames(res)[apply(res[,which(have_match_col>0), drop=FALSE], 2, which.max)], "D2"=names(have_match_col)[which(have_match_col>0)] ) nTM_all <- nrow(ref) nP_all <- nrow(bestmatched_pairs_all) TPR_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="") %in% apply(ref, 1, paste0, collapse=""))/nTM_all PPV_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="") %in% apply(ref, 1, paste0, collapse=""))/nP_all cat("threshold: ", threshold, "\nnb matched: ", nP_all,"; nb true matches: ", nTM_all, "\nTPR: ", TPR_all, "; PPV: ", PPV_all, "\n\n", sep="") } print.res_matching(res_match_6y) print.res_matching(res_match_11y) }
Probabilistic Patient Record Linkage
recordLink( data1, data2, dates1 = NULL, dates2 = NULL, eps_plus, eps_minus, aggreg_2ways = "mean", min_prev = 0.01, data1_cont2diff = NULL, data2_cont2diff = NULL, d_max, use_diff = TRUE )
recordLink( data1, data2, dates1 = NULL, dates2 = NULL, eps_plus, eps_minus, aggreg_2ways = "mean", min_prev = 0.01, data1_cont2diff = NULL, data2_cont2diff = NULL, d_max, use_diff = TRUE )
data1 |
either a binary ( |
data2 |
either a binary ( |
dates1 |
matrix or dataframe of dimension |
dates2 |
matrix or dataframe of dimension |
eps_plus |
discrepancy rate between |
eps_minus |
discrepancy rate between |
aggreg_2ways |
a character string indicating how to merge the posterior two
probability matrices obtained for each of the 2 databases. Four possibility are
currently implemented: |
min_prev |
minimum prevalence for the variables used in matching. Default is 1%. |
data1_cont2diff |
either a matrix or dataframe of continuous features,
such as age, for which the similarity measure uses the difference with
|
data2_cont2diff |
either a matrix or dataframe of continuous features,
such as age, for which the similarity measure uses the difference with
|
d_max |
a numeric vector of length |
use_diff |
logical flag indicating whether continuous differentiable variables should be used in the |
Dates:
the use of dates1
and dates2
requires that at least one date interval matches across
dates1
and dates2
for claiming an agreement on a diagnosis code between data1
and data2
,
in addition of having that very same code recorded in both.
a matrix of size n1 x n2
with the posterior probability of matching for each n1*n2
pair
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi:10.1038/sdata.2018.298.
set.seed(123) ncodes <- 500 npat <- 200 incid <- abs(rnorm(n=ncodes, 0.15, 0.07)) bin_codes <- rbinom(n=npat*ncodes, size=1, prob=rep(incid, npat)) bin_codes_mat <- matrix(bin_codes, ncol=ncodes, byrow = TRUE) data1_ex <- bin_codes_mat[1:(npat/2+npat/10),] data2_ex <- bin_codes_mat[c(1:(npat/10), (npat/2+npat/10 + 1):npat), ] rownames(data1_ex) <- paste0("ID", 1:(npat/2+npat/10), "_data1") rownames(data2_ex) <- paste0("ID", c(1:(npat/10), (npat/2+npat/10 + 1):npat), "_data2") if(interactive()){ res <- recordLink(data1 = data1_ex, data2 = data2_ex, use_diff = FALSE, eps_minus = 0.01, eps_plus = 0.01) round(res[c(1:3, 19:23), c(1:3, 19:23)], 3) }
set.seed(123) ncodes <- 500 npat <- 200 incid <- abs(rnorm(n=ncodes, 0.15, 0.07)) bin_codes <- rbinom(n=npat*ncodes, size=1, prob=rep(incid, npat)) bin_codes_mat <- matrix(bin_codes, ncol=ncodes, byrow = TRUE) data1_ex <- bin_codes_mat[1:(npat/2+npat/10),] data2_ex <- bin_codes_mat[c(1:(npat/10), (npat/2+npat/10 + 1):npat), ] rownames(data1_ex) <- paste0("ID", 1:(npat/2+npat/10), "_data1") rownames(data2_ex) <- paste0("ID", c(1:(npat/10), (npat/2+npat/10 + 1):npat), "_data2") if(interactive()){ res <- recordLink(data1 = data1_ex, data2 = data2_ex, use_diff = FALSE, eps_minus = 0.01, eps_plus = 0.01) round(res[c(1:3, 19:23), c(1:3, 19:23)], 3) }
Association testing using Han & Lahiri estimating equations and jackknife approach
test_han2018( match_prob, y, x, covar_y = NULL, covar_x = NULL, jackknife_nrep = 100, jackknife_blocksize = max(floor(min(length(y), nrow(x))/jackknife_nrep), 1), methods = c("F", "M", "M2"), dist_family = c("gaussian", "binomial") )
test_han2018( match_prob, y, x, covar_y = NULL, covar_x = NULL, jackknife_nrep = 100, jackknife_blocksize = max(floor(min(length(y), nrow(x))/jackknife_nrep), 1), methods = c("F", "M", "M2"), dist_family = c("gaussian", "binomial") )
match_prob |
matching probabilities matrix (e.g. obtained through |
y |
response variable of length |
x |
a |
covar_y |
a |
covar_x |
a |
jackknife_nrep |
the number of jackknife repetitions. Default is 100 (from Han et al.). |
jackknife_blocksize |
the number of observations to remove in each jackknife. |
methods |
a character vector which must be a subset of |
dist_family |
a character string indicating the distribution family for the glm.
Currently, only |
a list containing the following for each estimator in methods
:
beta
a vector containing the p
estimated coefficients
varcov
the p x p
variance-covariance matrix
of the beta
coefficients
zscores
a vector containing the p
Z-scores
pval
the corresponding Gaussian assumption p-values
Han, Y., and Lahiri, P. (2019) Statistical Analysis with Linked Data. International Statistical Review, 87: S139– S157. doi:10.1111/insr.12295.
# rm(list=ls()) # n_sims <- 500 # res <- pbapply::pblapply(1:n_sims, function(n){ # nx <- 99 # ny <- 103 # x <- matrix(ncol=2, nrow=ny, stats::rnorm(n=ny*2)) # # #plot(density(rbeta(n=1000, 1,2))) # match_prob <- diag(ny)[, 1:nx]#matrix(rbeta(n=ny*nx, 1, 2), nrow=ny, ncol=99) # # covar_y <- matrix(rnorm(n=ny, 1, 0.5), ncol=1) # covar_x <- matrix(ncol=3, nrow=ny, stats::rnorm(n=ny*3)) # # #y <- rnorm(n=ny, mean = x %*% c(2,-3) + covar_x %*% rep(0.2, ncol(covar_x)) + 0.5*covar_y, 0.5) # y <- rbinom(n=ny, 1, prob=expit(x %*% c(2,-3) + covar_x %*% # rep(0.2, ncol(covar_x)) + 0.5*covar_y)) # #glm(y~0+x+covar_y+covar_x, family = "binomial") # return( # #test_han2018(match_prob, y, x, jackknife_blocksize = 10, covar_x = NULL, covar_y = NULL) # test_han2018(match_prob, y[1:ny], x[1:nx, ], dist_family = "binomial", # jackknife_blocksize = 10, covar_x = covar_x[1:nx, ], # covar_y = covar_y[1:ny, , drop=FALSE]) # ) # }, cl=parallel::detectCores()-1) # pvals_F <- sapply(lapply(res, "[[", "F"), "[[", "beta") # pvals_M <- sapply(lapply(res, "[[", "M"), "[[", "beta") # pvals_M2 <- sapply(lapply(res, "[[", "M2"), "[[", "beta") # quantile(pvals_F) # quantile(pvals_M) # quantile(pvals_M2) # rowMeans(pvals_F<0.05) # rowMeans(pvals_M<0.05) # rowMeans(pvals_M2<0.05)
# rm(list=ls()) # n_sims <- 500 # res <- pbapply::pblapply(1:n_sims, function(n){ # nx <- 99 # ny <- 103 # x <- matrix(ncol=2, nrow=ny, stats::rnorm(n=ny*2)) # # #plot(density(rbeta(n=1000, 1,2))) # match_prob <- diag(ny)[, 1:nx]#matrix(rbeta(n=ny*nx, 1, 2), nrow=ny, ncol=99) # # covar_y <- matrix(rnorm(n=ny, 1, 0.5), ncol=1) # covar_x <- matrix(ncol=3, nrow=ny, stats::rnorm(n=ny*3)) # # #y <- rnorm(n=ny, mean = x %*% c(2,-3) + covar_x %*% rep(0.2, ncol(covar_x)) + 0.5*covar_y, 0.5) # y <- rbinom(n=ny, 1, prob=expit(x %*% c(2,-3) + covar_x %*% # rep(0.2, ncol(covar_x)) + 0.5*covar_y)) # #glm(y~0+x+covar_y+covar_x, family = "binomial") # return( # #test_han2018(match_prob, y, x, jackknife_blocksize = 10, covar_x = NULL, covar_y = NULL) # test_han2018(match_prob, y[1:ny], x[1:nx, ], dist_family = "binomial", # jackknife_blocksize = 10, covar_x = covar_x[1:nx, ], # covar_y = covar_y[1:ny, , drop=FALSE]) # ) # }, cl=parallel::detectCores()-1) # pvals_F <- sapply(lapply(res, "[[", "F"), "[[", "beta") # pvals_M <- sapply(lapply(res, "[[", "M"), "[[", "beta") # pvals_M2 <- sapply(lapply(res, "[[", "M2"), "[[", "beta") # quantile(pvals_F) # quantile(pvals_M) # quantile(pvals_M2) # rowMeans(pvals_F<0.05) # rowMeans(pvals_M<0.05) # rowMeans(pvals_M2<0.05)