Title: | Inference on Predicted Data |
---|---|
Description: | Performs valid statistical inference on predicted data (IPD) using recent methods, where for a subset of the data, the outcomes have been predicted by an algorithm. Provides a wrapper function with specified defaults for the type of model and method to be used for estimation and inference. Further provides methods for tidying and summarizing results. Salerno et al., (2024) <doi:10.48550/arXiv.2410.09665>. |
Authors: | Stephen Salerno [aut, cre, cph] , Jiacheng Miao [aut], Awan Afiaz [aut], Kentaro Hoffman [aut], Anna Neufeld [aut], Qiongshi Lu [aut], Tyler H McCormick [aut], Jeffrey T Leek [aut] |
Maintainer: | Stephen Salerno <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2024-10-25 05:31:30 UTC |
Source: | CRAN |
A
function for the calculation of the matrix A based on single dataset
A(X, Y, quant = NA, theta, method)
A(X, Y, quant = NA, theta, method)
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
quant |
quantile for quantile estimation |
theta |
parameter theta |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
matrix A based on single dataset
Augments the data used for an IPD method/model fit with additional information about each observation.
## S3 method for class 'ipd' augment(x, data = x$data_u, ...)
## S3 method for class 'ipd' augment(x, data = x$data_u, ...)
x |
An object of class |
data |
The |
... |
Additional arguments to be passed to the augment function. |
A data.frame
containing the original data along with fitted
values and residuals.
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Augment Data augmented_df <- augment(fit) head(augmented_df)
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Augment Data augmented_df <- augment(fit) head(augmented_df)
Calculates the optimal value of lhat for the prediction-powered confidence interval for GLMs.
calc_lhat_glm( grads, grads_hat, grads_hat_unlabeled, inv_hessian, coord = NULL, clip = FALSE )
calc_lhat_glm( grads, grads_hat, grads_hat_unlabeled, inv_hessian, coord = NULL, clip = FALSE )
grads |
(matrix): n x p matrix gradient of the loss function with respect to the parameter evaluated at the labeled data. |
grads_hat |
(matrix): n x p matrix gradient of the loss function with respect to the model parameter evaluated using predictions on the labeled data. |
grads_hat_unlabeled |
(matrix): N x p matrix gradient of the loss function with respect to the parameter evaluated using predictions on the unlabeled data. |
inv_hessian |
(matrix): p x p matrix inverse of the Hessian of the loss function with respect to the parameter. |
coord |
(int, optional): Coordinate for which to optimize |
clip |
(boolean, optional): Whether to clip the value of lhat to be
non-negative. Defaults to |
(float): Optimal value of lhat
in [0,1].
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) est <- ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u) stats <- ols_get_stats(est, X_l, Y_l, f_l, X_u, f_u) calc_lhat_glm(stats$grads, stats$grads_hat, stats$grads_hat_unlabeled, stats$inv_hessian, coord = NULL, clip = FALSE)
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) est <- ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u) stats <- ols_get_stats(est, X_l, Y_l, f_l, X_u, f_u) calc_lhat_glm(stats$grads, stats$grads_hat, stats$grads_hat_unlabeled, stats$inv_hessian, coord = NULL, clip = FALSE)
Computes the empirical CDF of the data.
compute_cdf(Y, grid, w = NULL)
compute_cdf(Y, grid, w = NULL)
Y |
(matrix): n x 1 matrix of observed data. |
grid |
(matrix): Grid of values to compute the CDF at. |
w |
(vector, optional): n-vector of sample weights. |
(list): Empirical CDF and its standard deviation at the specified grid points.
Y <- c(1, 2, 3, 4, 5) grid <- seq(0, 6, by = 0.5) compute_cdf(Y, grid)
Y <- c(1, 2, 3, 4, 5) grid <- seq(0, 6, by = 0.5) compute_cdf(Y, grid)
Computes the difference between the empirical CDFs of the data and the predictions.
compute_cdf_diff(Y, f, grid, w = NULL)
compute_cdf_diff(Y, f, grid, w = NULL)
Y |
(matrix): n x 1 matrix of observed data. |
f |
(matrix): n x 1 matrix of predictions. |
grid |
(matrix): Grid of values to compute the CDF at. |
w |
(vector, optional): n-vector of sample weights. |
(list): Difference between the empirical CDFs of the data and the predictions and its standard deviation at the specified grid points.
Y <- c(1, 2, 3, 4, 5) f <- c(1.1, 2.2, 3.3, 4.4, 5.5) grid <- seq(0, 6, by = 0.5) compute_cdf_diff(Y, f, grid)
Y <- c(1, 2, 3, 4, 5) f <- c(1.1, 2.2, 3.3, 4.4, 5.5) grid <- seq(0, 6, by = 0.5) compute_cdf_diff(Y, f, grid)
est_ini
function for initial estimation
est_ini(X, Y, quant = NA, method)
est_ini(X, Y, quant = NA, method)
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
initial estimatior
Glances at the IPD method/model fit, returning a one-row summary.
## S3 method for class 'ipd' glance(x, ...)
## S3 method for class 'ipd' glance(x, ...)
x |
An object of class |
... |
Additional arguments to be passed to the glance function. |
A one-row data frame summarizing the IPD method/model fit.
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Glance Output glance(fit)
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Glance Output glance(fit)
The main wrapper function to conduct ipd using various methods and models, and returns a list of fitted model components.
ipd( formula, method, model, data, label = NULL, unlabeled_data = NULL, seed = NULL, intercept = TRUE, alpha = 0.05, alternative = "two-sided", ... )
ipd( formula, method, model, data, label = NULL, unlabeled_data = NULL, seed = NULL, intercept = TRUE, alpha = 0.05, alternative = "two-sided", ... )
formula |
An object of class |
method |
The method to be used for fitting the model. Must be one of
|
model |
The type of model to be fitted. Must be one of |
data |
A |
label |
A |
unlabeled_data |
(optional) A |
seed |
(optional) An |
intercept |
|
alpha |
The significance level for confidence intervals. Default is
|
alternative |
A string specifying the alternative hypothesis. Must be
one of |
... |
Additional arguments to be passed to the fitting function. See
the |
1. Formula:
The ipd
function uses one formula argument that specifies both the
calibrating model (e.g., PostPI "relationship model", PPI "rectifier" model)
and the inferential model. These separate models will be created internally
based on the specific method
called.
2. Data:
The data can be specified in two ways:
Single data argument (data
) containing a stacked
data.frame
and a label identifier (label
).
Two data arguments, one for the labeled data (data
) and one
for the unlabeled data (unlabeled_data
).
For option (1), provide one data argument (data
) which contains a
stacked data.frame
with both the unlabeled and labeled data and a
label
argument that specify the column that identifies the labeled
versus the unlabeled observations in the stacked data.frame
NOTE: Labeled data identifiers can be:
"l", "lab", "label", "labeled", "labelled", "tst", "test", "true"
TRUE
Non-reference category (i.e., binary 1)
Unlabeled data identifiers can be:
"u", "unlab", "unlabeled", "unlabelled", "val", "validation", "false"
FALSE
Non-reference category (i.e., binary 0)
For option (2), provide separate data arguments for the labeled data set
(data
) and the unlabeled data set (unlabeled_data
). If the
second argument is provided, the function ignores the label identifier and
assumes the data provided is stacked.
3. Method:
Use the method
argument to specify the fitting method:
Wang et al. (2020) Post-Prediction Inference (PostPI)
Angelopoulos et al. (2023) Prediction-Powered Inference (PPI)
Angelopoulos et al. (2023) PPI++
Miao et al. (2023) Assumption-Lean and Data-Adaptive Post-Prediction Inference (PSPA)
4. Model:
Use the model
argument to specify the type of model:
Mean value of the outcome
q
th quantile of the outcome
Linear regression
Logistic regression
Poisson regression
The ipd
wrapper function will concatenate the method
and
model
arguments to identify the required helper function, following
the naming convention "method_model".
5. Auxiliary Arguments:
The wrapper function will take method-specific auxiliary arguments (e.g.,
q
for the quantile estimation models) and pass them to the helper
function through the "..." with specified defaults for simplicity.
6. Other Arguments:
All other arguments that relate to all methods (e.g., alpha, ci.type), or other method-specific arguments, will have defaults.
a summary of model output.
A list containing the fitted model components:
Estimated coefficients of the model
Standard errors of the estimated coefficients
Confidence intervals for the estimated coefficients
The formula used to fit the ipd model.
The data frame used for model fitting.
The method used for model fitting.
The type of model fitted.
Logical. Indicates if an intercept was included in the model.
Fitted model object containing estimated coefficients, standard errors, confidence intervals, and additional method-specific output.
Additional output specific to the method used.
#-- Generate Example Data set.seed(12345) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- PostPI Analytic Correction (Wang et al., 2020) ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- PostPI Bootstrap Correction (Wang et al., 2020) nboot <- 200 ipd(formula, method = "postpi_boot", model = "ols", data = dat, label = "set", nboot = nboot) #-- PPI (Angelopoulos et al., 2023) ipd(formula, method = "ppi", model = "ols", data = dat, label = "set") #-- PPI++ (Angelopoulos et al., 2023) ipd(formula, method = "ppi_plusplus", model = "ols", data = dat, label = "set") #-- PSPA (Miao et al., 2023) ipd(formula, method = "pspa", model = "ols", data = dat, label = "set")
#-- Generate Example Data set.seed(12345) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- PostPI Analytic Correction (Wang et al., 2020) ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- PostPI Bootstrap Correction (Wang et al., 2020) nboot <- 200 ipd(formula, method = "postpi_boot", model = "ols", data = dat, label = "set", nboot = nboot) #-- PPI (Angelopoulos et al., 2023) ipd(formula, method = "ppi", model = "ols", data = dat, label = "set") #-- PPI++ (Angelopoulos et al., 2023) ipd(formula, method = "ppi_plusplus", model = "ols", data = dat, label = "set") #-- PSPA (Miao et al., 2023) ipd(formula, method = "pspa", model = "ols", data = dat, label = "set")
link_grad
function for gradient of the link function
link_grad(t, method)
link_grad(t, method)
t |
t |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
gradient of the link function
link_Hessian
function for Hessians of the link function
link_Hessian(t, method)
link_Hessian(t, method)
t |
t |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
Hessians of the link function
Computes the natural logarithm of 1 plus the exponential of the input, to handle large inputs.
log1pexp(x)
log1pexp(x)
x |
(vector): A numeric vector of inputs. |
(vector): A numeric vector where each element is the result of log(1 + exp(x)).
x <- c(-1, 0, 1, 10, 100) log1pexp(x)
x <- c(-1, 0, 1, 10, 100) log1pexp(x)
Computes the statistics needed for the logstic regression-based prediction-powered inference.
logistic_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )
logistic_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )
est |
(vector): Point estimates of the coefficients. |
X_l |
(matrix): Covariates for the labeled data set. |
Y_l |
(vector): Labels for the labeled data set. |
f_l |
(vector): Predictions for the labeled data set. |
X_u |
(matrix): Covariates for the unlabeled data set. |
f_u |
(vector): Predictions for the unlabeled data set. |
w_l |
(vector, optional): Sample weights for the labeled data set. |
w_u |
(vector, optional): Sample weights for the unlabeled data set. |
use_u |
(bool, optional): Whether to use the unlabeled data set. |
(list): A list containing the following:
(matrix): n x p matrix gradient of the loss function with respect to the coefficients.
(matrix): n x p matrix gradient of the loss function with respect to the coefficients, evaluated using the labeled predictions.
(matrix): N x p matrix gradient of the loss function with respect to the coefficients, evaluated using the unlabeled predictions.
(matrix): p x p matrix inverse Hessian of the loss function with respect to the coefficients.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) est <- ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u) stats <- logistic_get_stats(est, X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) est <- ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u) stats <- logistic_get_stats(est, X_l, Y_l, f_l, X_u, f_u)
mean_psi
function for sample expectation of psi
mean_psi(X, Y, theta, quant = NA, method)
mean_psi(X, Y, theta, quant = NA, method)
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
sample expectation of psi
mean_psi_pop
function for sample expectation of PSPA psi
mean_psi_pop( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
mean_psi_pop( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
X_lab |
Array or data.frame containing observed covariates in labeled data. |
X_unlab |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_lab |
Array or data.frame of observed outcomes in labeled data. |
Yhat_lab |
Array or data.frame of predicted outcomes in labeled data. |
Yhat_unlab |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
sample expectation of PSPA psi
Computes the ordinary least squares coefficients.
ols(X, Y, return_se = FALSE)
ols(X, Y, return_se = FALSE)
X |
(matrix): n x p matrix of covariates. |
Y |
(vector): p-vector of outcome values. |
return_se |
(bool, optional): Whether to return the standard errors of the coefficients. |
(list): A list containing the following:
(vector): p-vector of ordinary least squares estimates of the coefficients.
(vector): If return_se == TRUE, return the p-vector of standard errors of the coefficients.
n <- 1000 X <- rnorm(n, 1, 1) Y <- X + rnorm(n, 0, 1) ols(X, Y, return_se = TRUE)
n <- 1000 X <- rnorm(n, 1, 1) Y <- X + rnorm(n, 0, 1) ols(X, Y, return_se = TRUE)
Computes the statistics needed for the OLS-based prediction-powered inference.
ols_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )
ols_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )
est |
(vector): Point estimates of the coefficients. |
X_l |
(matrix): Covariates for the labeled data set. |
Y_l |
(vector): Labels for the labeled data set. |
f_l |
(vector): Predictions for the labeled data set. |
X_u |
(matrix): Covariates for the unlabeled data set. |
f_u |
(vector): Predictions for the unlabeled data set. |
w_l |
(vector, optional): Sample weights for the labeled data set. |
w_u |
(vector, optional): Sample weights for the unlabeled data set. |
use_u |
(boolean, optional): Whether to use the unlabeled data set. |
(list): A list containing the following:
(matrix): n x p matrix gradient of the loss function with respect to the coefficients.
(matrix): n x p matrix gradient of the loss function with respect to the coefficients, evaluated using the labeled predictions.
(matrix): N x p matrix gradient of the loss function with respect to the coefficients, evaluated using the unlabeled predictions.
(matrix): p x p matrix inverse Hessian of the loss function with respect to the coefficients.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) est <- ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u) stats <- ols_get_stats(est, X_l, Y_l, f_l, X_u, f_u, use_u = TRUE)
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) est <- ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u) stats <- ols_get_stats(est, X_l, Y_l, f_l, X_u, f_u, use_u = TRUE)
optim_est
function for One-step update for obtaining estimator
optim_est( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
optim_est( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
X_lab |
Array or data.frame containing observed covariates in labeled data. |
X_unlab |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_lab |
Array or data.frame of observed outcomes in labeled data. |
Yhat_lab |
Array or data.frame of predicted outcomes in labeled data. |
Yhat_unlab |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
estimator
optim_weights
function for One-step update for obtaining estimator
optim_weights( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
optim_weights( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
X_lab |
Array or data.frame containing observed covariates in labeled data. |
X_unlab |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_lab |
Array or data.frame of observed outcomes in labeled data. |
Yhat_lab |
Array or data.frame of predicted outcomes in labeled data. |
Yhat_unlab |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
weights
Helper function for PostPI OLS estimation (analytic correction)
postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u, scale_se = TRUE, n_t = Inf)
postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u, scale_se = TRUE, n_t = Inf)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
scale_se |
(boolean): Logical argument to scale relationship model error variance. Defaults to TRUE; FALSE option is retained for posterity. |
n_t |
(integer, optional) Size of the dataset used to train the
prediction function (necessary if |
Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) https://www.pnas.org/doi/abs/10.1073/pnas.2001238117
A list of outputs: estimate of the inference model parameters and corresponding standard error estimate.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PostPI logistic regression (bootstrap correction)
postpi_boot_logistic( X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par", seed = NULL )
postpi_boot_logistic( X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par", seed = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
nboot |
(integer): Number of bootstrap samples. Defaults to 100. |
se_type |
(string): Which method to calculate the standard errors. Options include "par" (parametric) or "npar" (nonparametric). Defaults to "par". |
seed |
(optional) An |
Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) https://www.pnas.org/doi/abs/10.1073/pnas.2001238117
A list of outputs: estimate of inference model parameters and corresponding standard error based on both parametric and non-parametric bootstrap methods.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_logistic(X_l, Y_l, f_l, X_u, f_u, nboot = 200)
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_logistic(X_l, Y_l, f_l, X_u, f_u, nboot = 200)
Helper function for PostPI OLS estimation (bootstrap correction)
postpi_boot_ols( X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par", rel_func = "lm", scale_se = TRUE, n_t = Inf, seed = NULL )
postpi_boot_ols( X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par", rel_func = "lm", scale_se = TRUE, n_t = Inf, seed = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
nboot |
(integer): Number of bootstrap samples. Defaults to 100. |
se_type |
(string): Which method to calculate the standard errors. Options include "par" (parametric) or "npar" (nonparametric). Defaults to "par". |
rel_func |
(string): Method for fitting the relationship model. Options include "lm" (linear model), "rf" (random forest), and "gam" (generalized additive model). Defaults to "lm". |
scale_se |
(boolean): Logical argument to scale relationship model error variance. Defaults to TRUE; FALSE option is retained for posterity. |
n_t |
(integer, optional) Size of the dataset used to train the
prediction function (necessary if |
seed |
(optional) An |
Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) https://www.pnas.org/doi/abs/10.1073/pnas.2001238117
A list of outputs: estimate of inference model parameters and corresponding standard error based on both parametric and non-parametric bootstrap methods.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_ols(X_l, Y_l, f_l, X_u, f_u, nboot = 200)
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_ols(X_l, Y_l, f_l, X_u, f_u, nboot = 200)
Helper function for PPI logistic regression
ppi_logistic(X_l, Y_l, f_l, X_u, f_u, opts = NULL)
ppi_logistic(X_l, Y_l, f_l, X_u, f_u, opts = NULL)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
opts |
(list, optional): Options to pass to the optimizer. See ?optim for details. |
Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000
(list): A list containing the following:
(vector): vector of PPI logistic regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(vector): vector of the rectifier logistic regression coefficient estimates.
(matrix): covariance matrix for the gradients in the unlabeled data.
(matrix): covariance matrix for the gradients in the labeled data.
(matrix): matrix of gradients for the labeled data.
(matrix): matrix of predicted gradients for the unlabeled data.
(matrix): matrix of predicted gradients for the labeled data.
(matrix): inverse Hessian matrix.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_logistic(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_logistic(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI mean estimation
ppi_mean(Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided")
ppi_mean(Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided")
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
alternative |
(string): Alternative hypothesis. Must be one of
|
Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000
tuple: Lower and upper bounds of the prediction-powered confidence interval for the mean.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_mean(Y_l, f_l, f_u)
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_mean(Y_l, f_l, f_u)
Helper function for prediction-powered inference for OLS estimation
ppi_ols(X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL)
ppi_ols(X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000
(list): A list containing the following:
(vector): vector of PPI OLS regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(vector): vector of the rectifier OLS regression coefficient estimates.
dat <- simdat() form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_ols(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat() form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ logistic regression
ppi_plusplus_logistic( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )
ppi_plusplus_logistic( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
https://arxiv.org/abs/2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
opts |
(list, optional): Options to pass to the optimizer. See ?optim for details. |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453
(list): A list containing the following:
(vector): vector of PPI++ logistic regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(float): estimated power-tuning parameter.
(vector): vector of the rectifier logistic regression coefficient estimates.
(matrix): covariance matrix for the gradients in the unlabeled data.
(matrix): covariance matrix for the gradients in the labeled data.
(matrix): matrix of gradients for the labeled data.
(matrix): matrix of predicted gradients for the unlabeled data.
(matrix): matrix of predicted gradients for the labeled data.
(matrix): inverse Hessian matrix.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ logistic regression (point estimate)
ppi_plusplus_logistic_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )
ppi_plusplus_logistic_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
https://arxiv.org/abs/2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
opts |
(list, optional): Options to pass to the optimizer. See ?optim for details. |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'
(vector): vector of prediction-powered point estimates of the logistic regression coefficients.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ mean estimation
ppi_plusplus_mean( Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided", lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
ppi_plusplus_mean( Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided", lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
alternative |
(string): Alternative hypothesis. Must be one of
|
lhat |
(float, optional): Power-tuning parameter (see
https://arxiv.org/abs/2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'
tuple: Lower and upper bounds of the prediction-powered confidence interval for the mean.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean(Y_l, f_l, f_u)
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean(Y_l, f_l, f_u)
Helper function for PPI++ mean estimation (point estimate)
ppi_plusplus_mean_est( Y_l, f_l, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
ppi_plusplus_mean_est( Y_l, f_l, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
https://arxiv.org/abs/2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453
float or ndarray: Prediction-powered point estimate of the mean.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean_est(Y_l, f_l, f_u)
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean_est(Y_l, f_l, f_u)
Helper function for PPI++ OLS estimation
ppi_plusplus_ols( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
ppi_plusplus_ols( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
https://arxiv.org/abs/2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'
(list): A list containing the following:
(vector): vector of PPI++ OLS regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(float): estimated power-tuning parameter.
(vector): vector of the rectifier OLS regression coefficient estimates.
(matrix): covariance matrix for the gradients in the unlabeled data.
(matrix): covariance matrix for the gradients in the labeled data.
(matrix): matrix of gradients for the labeled data.
(matrix): matrix of predicted gradients for the unlabeled data.
(matrix): matrix of predicted gradients for the labeled data.
(matrix): inverse Hessian matrix.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ OLS estimation (point estimate)
ppi_plusplus_ols_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
ppi_plusplus_ols_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
https://arxiv.org/abs/2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453
(vector): vector of prediction-powered point estimates of the OLS coefficients.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ quantile estimation
ppi_plusplus_quantile( Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE, w_l = NULL, w_u = NULL )
ppi_plusplus_quantile( Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
exact_grid |
(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE). |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453
tuple: Lower and upper bounds of the prediction-powered confidence interval for the quantile.
dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile(Y_l, f_l, f_u, q = 0.5)
dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile(Y_l, f_l, f_u, q = 0.5)
Helper function for PPI++ quantile estimation (point estimate)
ppi_plusplus_quantile_est( Y_l, f_l, f_u, q, exact_grid = FALSE, w_l = NULL, w_u = NULL )
ppi_plusplus_quantile_est( Y_l, f_l, f_u, q, exact_grid = FALSE, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
exact_grid |
(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE). |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'
(float): Prediction-powered point estimate of the quantile.
dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile_est(Y_l, f_l, f_u, q = 0.5)
dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile_est(Y_l, f_l, f_u, q = 0.5)
Helper function for PPI quantile estimation
ppi_quantile(Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE)
ppi_quantile(Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE)
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
exact_grid |
(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE). |
Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000
tuple: Lower and upper bounds of the prediction-powered confidence interval for the quantile.
dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_quantile(Y_l, f_l, f_u, q = 0.5)
dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_quantile(Y_l, f_l, f_u, q = 0.5)
Prints a brief summary of the IPD method/model combination.
## S3 method for class 'ipd' print(x, ...)
## S3 method for class 'ipd' print(x, ...)
x |
An object of class |
... |
Additional arguments to be passed to the print function. |
The input x
, invisibly.
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Print Output print(fit)
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Print Output print(fit)
Prints a detailed summary of the IPD method/model combination.
## S3 method for class 'summary.ipd' print(x, ...)
## S3 method for class 'summary.ipd' print(x, ...)
x |
An object of class |
... |
Additional arguments to be passed to the print function. |
The input x
, invisibly.
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Summarize Output summ_fit <- summary(fit) print(summ_fit)
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Summarize Output summ_fit <- summary(fit) print(summ_fit)
psi
function for estimating equation
psi(X, Y, theta, quant = NA, method)
psi(X, Y, theta, quant = NA, method)
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
esimating equation
Helper function for PSPA logistic regression
pspa_logistic(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
pspa_logistic(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of binary labeled outcomes. |
f_l |
(vector): n-vector of binary predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of binary predictions in the unlabeled data. |
weights |
(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05 |
Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_logistic(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_logistic(X_l, Y_l, f_l, X_u, f_u)
Helper function for PSPA mean estimation
pspa_mean(Y_l, f_l, f_u, weights = NA, alpha = 0.05)
pspa_mean(Y_l, f_l, f_u, weights = NA, alpha = 0.05)
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
weights |
(array): 1-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
Post-prediction adaptive inference (Miao et al., 2023) https://arxiv.org/abs/2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_mean(Y_l, f_l, f_u)
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_mean(Y_l, f_l, f_u)
Helper function for PSPA OLS for linear regression
pspa_ols(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
pspa_ols(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
weights |
(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_ols(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PSPA Poisson regression
pspa_poisson(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
pspa_poisson(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of count labeled outcomes. |
f_l |
(vector): n-vector of binary predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of binary predictions in the unlabeled data. |
weights |
(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05 |
Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_poisson(X_l, Y_l, f_l, X_u, f_u)
dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set == "labeled",]) Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",]) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_poisson(X_l, Y_l, f_l, X_u, f_u)
Helper function for PSPA quantile estimation
pspa_quantile(Y_l, f_l, f_u, q, weights = NA, alpha = 0.05)
pspa_quantile(Y_l, f_l, f_u, q, weights = NA, alpha = 0.05)
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
weights |
(array): 1-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_quantile(Y_l, f_l, f_u, q = 0.5)
dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_quantile(Y_l, f_l, f_u, q = 0.5)
pspa_y
function conducts post-prediction M-Estimation.
pspa_y( X_lab = NA, X_unlab = NA, Y_lab, Yhat_lab, Yhat_unlab, alpha = 0.05, weights = NA, quant = NA, intercept = FALSE, method )
pspa_y( X_lab = NA, X_unlab = NA, Y_lab, Yhat_lab, Yhat_unlab, alpha = 0.05, weights = NA, quant = NA, intercept = FALSE, method )
X_lab |
Array or data.frame containing observed covariates in labeled data. |
X_unlab |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_lab |
Array or data.frame of observed outcomes in labeled data. |
Yhat_lab |
Array or data.frame of predicted outcomes in labeled data. |
Yhat_unlab |
Array or data.frame of predicted outcomes in unlabeled data. |
alpha |
Specifies the confidence level as 1 - alpha for confidence intervals. |
weights |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
quant |
quantile for quantile estimation |
intercept |
Boolean indicating if the input covariates' data contains the intercept (TRUE if the input data contains) |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
A summary table presenting point estimates, standard error, confidence intervals (1 - alpha), P-values, and weights.
data <- sim_data_y() X_lab <- data$X_lab X_unlab <- data$X_unlab Y_lab <- data$Y_lab Yhat_lab <- data$Yhat_lab Yhat_unlab <- data$Yhat_unlab pspa_y(X_lab = X_lab, X_unlab = X_unlab, Y_lab = Y_lab, Yhat_lab = Yhat_lab, Yhat_unlab = Yhat_unlab, alpha = 0.05, method = "ols")
data <- sim_data_y() X_lab <- data$X_lab X_unlab <- data$X_unlab Y_lab <- data$Y_lab Yhat_lab <- data$Yhat_lab Yhat_unlab <- data$Yhat_unlab pspa_y(X_lab = X_lab, X_unlab = X_unlab, Y_lab = Y_lab, Yhat_lab = Yhat_lab, Yhat_unlab = Yhat_unlab, alpha = 0.05, method = "ols")
Computes the rectified CDF of the data.
rectified_cdf(Y_l, f_l, f_u, grid, w_l = NULL, w_u = NULL)
rectified_cdf(Y_l, f_l, f_u, grid, w_l = NULL, w_u = NULL)
Y_l |
(vector): Gold-standard labels. |
f_l |
(vector): Predictions corresponding to the gold-standard labels. |
f_u |
(vector): Predictions corresponding to the unlabeled data. |
grid |
(vector): Grid of values to compute the CDF at. |
w_l |
(vector, optional): Sample weights for the labeled data set. |
w_u |
(vector, optional): Sample weights for the unlabeled data set. |
(vector): Rectified CDF of the data at the specified grid points.
Y_l <- c(1, 2, 3, 4, 5) f_l <- c(1.1, 2.2, 3.3, 4.4, 5.5) f_u <- c(1.2, 2.3, 3.4) grid <- seq(0, 6, by = 0.5) rectified_cdf(Y_l, f_l, f_u, grid)
Y_l <- c(1, 2, 3, 4, 5) f_l <- c(1.1, 2.2, 3.3, 4.4, 5.5) f_u <- c(1.2, 2.3, 3.4) grid <- seq(0, 6, by = 0.5) rectified_cdf(Y_l, f_l, f_u, grid)
Computes a rectified p-value.
rectified_p_value( rectifier, rectifier_std, imputed_mean, imputed_std, null = 0, alternative = "two-sided" )
rectified_p_value( rectifier, rectifier_std, imputed_mean, imputed_std, null = 0, alternative = "two-sided" )
rectifier |
(float or vector): Rectifier value. |
rectifier_std |
(float or vector): Rectifier standard deviation. |
imputed_mean |
(float or vector): Imputed mean. |
imputed_std |
(float or vector): Imputed standard deviation. |
null |
(float, optional): Value of the null hypothesis to be tested.
Defaults to |
alternative |
(str, optional): Alternative hypothesis, either 'two-sided', 'larger' or 'smaller'. |
(float or vector): The rectified p-value.
rectifier <- 0.7 rectifier_std <- 0.5 imputed_mean <- 1.5 imputed_std <- 0.3 rectified_p_value(rectifier, rectifier_std, imputed_mean, imputed_std)
rectifier <- 0.7 rectifier_std <- 0.5 imputed_mean <- 1.5 imputed_std <- 0.3 rectified_p_value(rectifier, rectifier_std, imputed_mean, imputed_std)
Sigma_cal
function for variance-covariance matrix of the
estimation equation
Sigma_cal( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
Sigma_cal( X_lab, X_unlab, Y_lab, Yhat_lab, Yhat_unlab, w, theta, quant = NA, method )
X_lab |
Array or data.frame containing observed covariates in labeled data. |
X_unlab |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_lab |
Array or data.frame of observed outcomes in labeled data. |
Yhat_lab |
Array or data.frame of predicted outcomes in labeled data. |
Yhat_unlab |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
variance-covariance matrix of the estimation equation
sim_data_y
for simulation with ML-predicted Y
sim_data_y(r = 0.9, binary = FALSE)
sim_data_y(r = 0.9, binary = FALSE)
r |
imputation correlation |
binary |
simulate binary outcome or not |
simulated data
Data generation function for various underlying models
simdat( n = c(300, 300, 300), effect = 1, sigma_Y = 1, model = "ols", shift = 0, scale = 1 )
simdat( n = c(300, 300, 300), effect = 1, sigma_Y = 1, model = "ols", shift = 0, scale = 1 )
n |
Integer vector of size 3 indicating the sample sizes in the training, labeled, and unlabeled data sets, respectively |
effect |
Regression coefficient for the first variable of interest for inference. Defaults is 1. |
sigma_Y |
Residual variance for the generated outcome. Defaults is 1. |
model |
The type of model to be generated. Must be one of
|
shift |
Scalar shift of the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 0. |
scale |
Scaling factor for the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 1. |
A data.frame containing n rows and columns corresponding to the labeled outcome (Y), the predicted outcome (f), a character variable (set) indicating which data set the observation belongs to (training, labeled, or unlabeled), and four independent, normally distributed predictors (X1, X2, X3, and X4), where applicable.
#-- Mean dat_mean <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "mean") head(dat_mean) #-- Linear Regression dat_ols <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "ols") head(dat_ols)
#-- Mean dat_mean <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "mean") head(dat_mean) #-- Linear Regression dat_ols <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "ols") head(dat_ols)
Produces a summary of the IPD method/model combination.
## S3 method for class 'ipd' summary(object, ...)
## S3 method for class 'ipd' summary(object, ...)
object |
An object of class |
... |
Additional arguments to be passed to the summary function. |
A list containing:
Model coefficients and related statistics.
Performance metrics of the model fit.
Additional summary information.
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Summarize Output summ_fit <- summary(fit) summ_fit
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Summarize Output summ_fit <- summary(fit) summ_fit
Tidies the IPD method/model fit into a data frame.
## S3 method for class 'ipd' tidy(x, ...)
## S3 method for class 'ipd' tidy(x, ...)
x |
An object of class |
... |
Additional arguments to be passed to the tidy function. |
A tidy data frame of the model's coefficients.
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Tidy Output tidy(fit)
#-- Generate Example Data set.seed(2023) dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Fit IPD fit <- ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set") #-- Tidy Output tidy(fit)
Computes the weighted least squares estimate of the coefficients.
wls(X, Y, w = NULL, return_se = FALSE)
wls(X, Y, w = NULL, return_se = FALSE)
X |
(matrix): n x p matrix of covariates. |
Y |
(vector): p-vector of outcome values. |
w |
(vector, optional): n-vector of sample weights. |
return_se |
(bool, optional): Whether to return the standard errors of the coefficients. |
(list): A list containing the following:
(vector): p-vector of weighted least squares estimates of the coefficients.
(vector): If return_se == TRUE, return the p-vector of standard errors of the coefficients.
n <- 1000 X <- rnorm(n, 1, 1) w <- rep(1, n) Y <- X + rnorm(n, 0, 1) wls(X, Y, w = w, return_se = TRUE)
n <- 1000 X <- rnorm(n, 1, 1) w <- rep(1, n) Y <- X + rnorm(n, 0, 1) wls(X, Y, w = w, return_se = TRUE)
Calculates normal confidence intervals for a given alternative at a given significance level.
zconfint_generic(mean, std_mean, alpha, alternative)
zconfint_generic(mean, std_mean, alpha, alternative)
mean |
(float): Estimated normal mean. |
std_mean |
(float): Estimated standard error of the mean. |
alpha |
(float): Significance level in [0,1] |
alternative |
(string): Alternative hypothesis, either 'two-sided', 'larger' or 'smaller'. |
(vector): Lower and upper (1 - alpha) * 100% confidence limits.
n <- 1000 Y <- rnorm(n, 1, 1) se_Y <- sd(Y) / sqrt(n) zconfint_generic(Y, se_Y, alpha = 0.05, alternative = "two-sided")
n <- 1000 Y <- rnorm(n, 1, 1) se_Y <- sd(Y) / sqrt(n) zconfint_generic(Y, se_Y, alpha = 0.05, alternative = "two-sided")
Computes the z-statistic and the corresponding p-value for a given test.
zstat_generic(value1, value2, std_diff, alternative, diff = 0)
zstat_generic(value1, value2, std_diff, alternative, diff = 0)
value1 |
(numeric): The first value or sample mean. |
value2 |
(numeric): The second value or sample mean. |
std_diff |
(numeric): The standard error of the difference between the two values. |
alternative |
(character): The alternative hypothesis. Can be one of "two-sided" (or "2-sided", "2s"), "larger" (or "l"), or "smaller" (or "s"). |
diff |
(numeric, optional): The hypothesized difference between the two values. Default is 0. |
(list): A list containing the following:
(numeric): The computed z-statistic.
(numeric): The corresponding p-value for the test.
value1 <- 1.5 value2 <- 1.0 std_diff <- 0.2 alternative <- "two-sided" result <- zstat_generic(value1, value2, std_diff, alternative)
value1 <- 1.5 value2 <- 1.0 std_diff <- 0.2 alternative <- "two-sided" result <- zstat_generic(value1, value2, std_diff, alternative)