Package 'ipd'

Title: Inference on Predicted Data
Description: Performs valid statistical inference on predicted data (IPD) using recent methods, where for a subset of the data, the outcomes have been predicted by an algorithm. Provides a wrapper function with specified defaults for the type of model and method to be used for estimation and inference. Further provides methods for tidying and summarizing results. Salerno et al., (2024) <doi:10.48550/arXiv.2410.09665>.
Authors: Stephen Salerno [aut, cre, cph] , Jiacheng Miao [aut], Awan Afiaz [aut], Kentaro Hoffman [aut], Anna Neufeld [aut], Qiongshi Lu [aut], Tyler H McCormick [aut], Jeffrey T Leek [aut]
Maintainer: Stephen Salerno <[email protected]>
License: MIT + file LICENSE
Version: 0.1.2
Built: 2024-10-25 05:31:30 UTC
Source: CRAN

Help Index


Calculation of the matrix A based on single dataset

Description

A function for the calculation of the matrix A based on single dataset

Usage

A(X, Y, quant = NA, theta, method)

Arguments

X

Array or data.frame containing covariates

Y

Array or data.frame of outcomes

quant

quantile for quantile estimation

theta

parameter theta

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

matrix A based on single dataset


Augment Data from an IPD Fit

Description

Augments the data used for an IPD method/model fit with additional information about each observation.

Usage

## S3 method for class 'ipd'
augment(x, data = x$data_u, ...)

Arguments

x

An object of class ipd.

data

The data.frame used to fit the model. Default is x$data.

...

Additional arguments to be passed to the augment function.

Value

A data.frame containing the original data along with fitted values and residuals.

Examples

#-- Generate Example Data

set.seed(2023)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- Fit IPD

fit <- ipd(formula, method = "postpi_analytic", model = "ols",

  data = dat, label = "set")

#-- Augment Data

augmented_df <- augment(fit)

head(augmented_df)

Estimate PPI++ Power Tuning Parameter

Description

Calculates the optimal value of lhat for the prediction-powered confidence interval for GLMs.

Usage

calc_lhat_glm(
  grads,
  grads_hat,
  grads_hat_unlabeled,
  inv_hessian,
  coord = NULL,
  clip = FALSE
)

Arguments

grads

(matrix): n x p matrix gradient of the loss function with respect to the parameter evaluated at the labeled data.

grads_hat

(matrix): n x p matrix gradient of the loss function with respect to the model parameter evaluated using predictions on the labeled data.

grads_hat_unlabeled

(matrix): N x p matrix gradient of the loss function with respect to the parameter evaluated using predictions on the unlabeled data.

inv_hessian

(matrix): p x p matrix inverse of the Hessian of the loss function with respect to the parameter.

coord

(int, optional): Coordinate for which to optimize lhat. If None, it optimizes the total variance over all coordinates. Must be in (1, ..., d) where d is the shape of the estimand.

clip

(boolean, optional): Whether to clip the value of lhat to be non-negative. Defaults to False.

Value

(float): Optimal value of lhat in [0,1].

Examples

dat <- simdat(model = "ols")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

est <- ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u)

stats <- ols_get_stats(est, X_l, Y_l, f_l, X_u, f_u)

calc_lhat_glm(stats$grads, stats$grads_hat, stats$grads_hat_unlabeled,

  stats$inv_hessian, coord = NULL, clip = FALSE)

Empirical CDF of the Data

Description

Computes the empirical CDF of the data.

Usage

compute_cdf(Y, grid, w = NULL)

Arguments

Y

(matrix): n x 1 matrix of observed data.

grid

(matrix): Grid of values to compute the CDF at.

w

(vector, optional): n-vector of sample weights.

Value

(list): Empirical CDF and its standard deviation at the specified grid points.

Examples

Y <- c(1, 2, 3, 4, 5)

grid <- seq(0, 6, by = 0.5)

compute_cdf(Y, grid)

Empirical CDF Difference

Description

Computes the difference between the empirical CDFs of the data and the predictions.

Usage

compute_cdf_diff(Y, f, grid, w = NULL)

Arguments

Y

(matrix): n x 1 matrix of observed data.

f

(matrix): n x 1 matrix of predictions.

grid

(matrix): Grid of values to compute the CDF at.

w

(vector, optional): n-vector of sample weights.

Value

(list): Difference between the empirical CDFs of the data and the predictions and its standard deviation at the specified grid points.

Examples

Y <- c(1, 2, 3, 4, 5)

f <- c(1.1, 2.2, 3.3, 4.4, 5.5)

grid <- seq(0, 6, by = 0.5)

compute_cdf_diff(Y, f, grid)

Initial estimation

Description

est_ini function for initial estimation

Usage

est_ini(X, Y, quant = NA, method)

Arguments

X

Array or data.frame containing covariates

Y

Array or data.frame of outcomes

quant

quantile for quantile estimation

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

initial estimatior


Glance at an IPD Fit

Description

Glances at the IPD method/model fit, returning a one-row summary.

Usage

## S3 method for class 'ipd'
glance(x, ...)

Arguments

x

An object of class ipd.

...

Additional arguments to be passed to the glance function.

Value

A one-row data frame summarizing the IPD method/model fit.

Examples

#-- Generate Example Data

set.seed(2023)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- Fit IPD

fit <- ipd(formula, method = "postpi_analytic", model = "ols",

  data = dat, label = "set")

#-- Glance Output

glance(fit)

Inference on Predicted Data (ipd)

Description

The main wrapper function to conduct ipd using various methods and models, and returns a list of fitted model components.

Usage

ipd(
  formula,
  method,
  model,
  data,
  label = NULL,
  unlabeled_data = NULL,
  seed = NULL,
  intercept = TRUE,
  alpha = 0.05,
  alternative = "two-sided",
  ...
)

Arguments

formula

An object of class formula: a symbolic description of the model to be fitted. Must be of the form Y - f ~ X, where Y is the name of the column corresponding to the observed outcome in the labeled data, f is the name of the column corresponding to the predicted outcome in both labeled and unlabeled data, and X corresponds to the features of interest (i.e., X = X1 + ... + Xp).

method

The method to be used for fitting the model. Must be one of "postpi_analytic", "postpi_boot", "ppi", "pspa", or "ppi_plusplus".

model

The type of model to be fitted. Must be one of "mean", "quantile", "ols", or "logistic".

data

A data.frame containing the variables in the model, either a stacked data frame with a specific column identifying the labeled versus unlabeled observations (label), or only the labeled data set. Must contain columns for the observed outcomes (Y), the predicted outcomes (f), and the features (X) needed to specify the formula.

label

A string, int, or logical specifying the column in the data that distinguishes between the labeled and unlabeled observations. See the Details section for more information. If NULL, unlabeled_data must be specified.

unlabeled_data

(optional) A data.frame of unlabeled data. If NULL, label must be specified. Specifying both the label and unlabeled_data arguments will result in an error message. If specified, must contain columns for the predicted outcomes (f), and the features (X) needed to specify the formula.

seed

(optional) An integer seed for random number generation.

intercept

Logical. Should an intercept be included in the model? Default is TRUE.

alpha

The significance level for confidence intervals. Default is 0.05.

alternative

A string specifying the alternative hypothesis. Must be one of "two-sided", "less", or "greater".

...

Additional arguments to be passed to the fitting function. See the Details section for more information.

Details

1. Formula:

The ipd function uses one formula argument that specifies both the calibrating model (e.g., PostPI "relationship model", PPI "rectifier" model) and the inferential model. These separate models will be created internally based on the specific method called.

2. Data:

The data can be specified in two ways:

  1. Single data argument (data) containing a stacked data.frame and a label identifier (label).

  2. Two data arguments, one for the labeled data (data) and one for the unlabeled data (unlabeled_data).

For option (1), provide one data argument (data) which contains a stacked data.frame with both the unlabeled and labeled data and a label argument that specify the column that identifies the labeled versus the unlabeled observations in the stacked data.frame

NOTE: Labeled data identifiers can be:

String

"l", "lab", "label", "labeled", "labelled", "tst", "test", "true"

Logical

TRUE

Factor

Non-reference category (i.e., binary 1)

Unlabeled data identifiers can be:

String

"u", "unlab", "unlabeled", "unlabelled", "val", "validation", "false"

Logical

FALSE

Factor

Non-reference category (i.e., binary 0)

For option (2), provide separate data arguments for the labeled data set (data) and the unlabeled data set (unlabeled_data). If the second argument is provided, the function ignores the label identifier and assumes the data provided is stacked.

3. Method:

Use the method argument to specify the fitting method:

"postpi"

Wang et al. (2020) Post-Prediction Inference (PostPI)

"ppi"

Angelopoulos et al. (2023) Prediction-Powered Inference (PPI)

"ppi_plusplus"

Angelopoulos et al. (2023) PPI++

"pspa"

Miao et al. (2023) Assumption-Lean and Data-Adaptive Post-Prediction Inference (PSPA)

4. Model:

Use the model argument to specify the type of model:

"mean"

Mean value of the outcome

"quantile"

qth quantile of the outcome

"ols"

Linear regression

"logistic"

Logistic regression

"poisson"

Poisson regression

The ipd wrapper function will concatenate the method and model arguments to identify the required helper function, following the naming convention "method_model".

5. Auxiliary Arguments:

The wrapper function will take method-specific auxiliary arguments (e.g., q for the quantile estimation models) and pass them to the helper function through the "..." with specified defaults for simplicity.

6. Other Arguments:

All other arguments that relate to all methods (e.g., alpha, ci.type), or other method-specific arguments, will have defaults.

Value

a summary of model output.

A list containing the fitted model components:

coefficients

Estimated coefficients of the model

se

Standard errors of the estimated coefficients

ci

Confidence intervals for the estimated coefficients

formula

The formula used to fit the ipd model.

data

The data frame used for model fitting.

method

The method used for model fitting.

model

The type of model fitted.

intercept

Logical. Indicates if an intercept was included in the model.

fit

Fitted model object containing estimated coefficients, standard errors, confidence intervals, and additional method-specific output.

...

Additional output specific to the method used.

Examples

#-- Generate Example Data

set.seed(12345)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- PostPI Analytic Correction (Wang et al., 2020)

ipd(formula, method = "postpi_analytic", model = "ols",

    data = dat, label = "set")

#-- PostPI Bootstrap Correction (Wang et al., 2020)

nboot <- 200

ipd(formula, method = "postpi_boot", model = "ols",

    data = dat, label = "set", nboot = nboot)

#-- PPI (Angelopoulos et al., 2023)

ipd(formula, method = "ppi", model = "ols",

    data = dat, label = "set")

#-- PPI++ (Angelopoulos et al., 2023)

ipd(formula, method = "ppi_plusplus", model = "ols",

    data = dat, label = "set")

#-- PSPA (Miao et al., 2023)

ipd(formula, method = "pspa", model = "ols",

    data = dat, label = "set")

Log1p Exponential

Description

Computes the natural logarithm of 1 plus the exponential of the input, to handle large inputs.

Usage

log1pexp(x)

Arguments

x

(vector): A numeric vector of inputs.

Value

(vector): A numeric vector where each element is the result of log(1 + exp(x)).

Examples

x <- c(-1, 0, 1, 10, 100)

log1pexp(x)

Logistic Regression Gradient and Hessian

Description

Computes the statistics needed for the logstic regression-based prediction-powered inference.

Usage

logistic_get_stats(
  est,
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  w_l = NULL,
  w_u = NULL,
  use_u = TRUE
)

Arguments

est

(vector): Point estimates of the coefficients.

X_l

(matrix): Covariates for the labeled data set.

Y_l

(vector): Labels for the labeled data set.

f_l

(vector): Predictions for the labeled data set.

X_u

(matrix): Covariates for the unlabeled data set.

f_u

(vector): Predictions for the unlabeled data set.

w_l

(vector, optional): Sample weights for the labeled data set.

w_u

(vector, optional): Sample weights for the unlabeled data set.

use_u

(bool, optional): Whether to use the unlabeled data set.

Value

(list): A list containing the following:

grads

(matrix): n x p matrix gradient of the loss function with respect to the coefficients.

grads_hat

(matrix): n x p matrix gradient of the loss function with respect to the coefficients, evaluated using the labeled predictions.

grads_hat_unlabeled

(matrix): N x p matrix gradient of the loss function with respect to the coefficients, evaluated using the unlabeled predictions.

inv_hessian

(matrix): p x p matrix inverse Hessian of the loss function with respect to the coefficients.

Examples

dat <- simdat(model = "logistic")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

est <- ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u)

stats <- logistic_get_stats(est, X_l, Y_l, f_l, X_u, f_u)

Sample expectation of psi

Description

mean_psi function for sample expectation of psi

Usage

mean_psi(X, Y, theta, quant = NA, method)

Arguments

X

Array or data.frame containing covariates

Y

Array or data.frame of outcomes

theta

parameter theta

quant

quantile for quantile estimation

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

sample expectation of psi


Sample expectation of PSPA psi

Description

mean_psi_pop function for sample expectation of PSPA psi

Usage

mean_psi_pop(
  X_lab,
  X_unlab,
  Y_lab,
  Yhat_lab,
  Yhat_unlab,
  w,
  theta,
  quant = NA,
  method
)

Arguments

X_lab

Array or data.frame containing observed covariates in labeled data.

X_unlab

Array or data.frame containing observed or predicted covariates in unlabeled data.

Y_lab

Array or data.frame of observed outcomes in labeled data.

Yhat_lab

Array or data.frame of predicted outcomes in labeled data.

Yhat_unlab

Array or data.frame of predicted outcomes in unlabeled data.

w

weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates).

theta

parameter theta

quant

quantile for quantile estimation

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

sample expectation of PSPA psi


Ordinary Least Squares

Description

Computes the ordinary least squares coefficients.

Usage

ols(X, Y, return_se = FALSE)

Arguments

X

(matrix): n x p matrix of covariates.

Y

(vector): p-vector of outcome values.

return_se

(bool, optional): Whether to return the standard errors of the coefficients.

Value

(list): A list containing the following:

theta

(vector): p-vector of ordinary least squares estimates of the coefficients.

se

(vector): If return_se == TRUE, return the p-vector of standard errors of the coefficients.

Examples

n <- 1000

X <- rnorm(n, 1, 1)

Y <- X + rnorm(n, 0, 1)

ols(X, Y, return_se = TRUE)

OLS Gradient and Hessian

Description

Computes the statistics needed for the OLS-based prediction-powered inference.

Usage

ols_get_stats(
  est,
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  w_l = NULL,
  w_u = NULL,
  use_u = TRUE
)

Arguments

est

(vector): Point estimates of the coefficients.

X_l

(matrix): Covariates for the labeled data set.

Y_l

(vector): Labels for the labeled data set.

f_l

(vector): Predictions for the labeled data set.

X_u

(matrix): Covariates for the unlabeled data set.

f_u

(vector): Predictions for the unlabeled data set.

w_l

(vector, optional): Sample weights for the labeled data set.

w_u

(vector, optional): Sample weights for the unlabeled data set.

use_u

(boolean, optional): Whether to use the unlabeled data set.

Value

(list): A list containing the following:

grads

(matrix): n x p matrix gradient of the loss function with respect to the coefficients.

grads_hat

(matrix): n x p matrix gradient of the loss function with respect to the coefficients, evaluated using the labeled predictions.

grads_hat_unlabeled

(matrix): N x p matrix gradient of the loss function with respect to the coefficients, evaluated using the unlabeled predictions.

inv_hessian

(matrix): p x p matrix inverse Hessian of the loss function with respect to the coefficients.

Examples

dat <- simdat(model = "ols")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

est <- ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u)

stats <- ols_get_stats(est, X_l, Y_l, f_l, X_u, f_u, use_u = TRUE)

One-step update for obtaining estimator

Description

optim_est function for One-step update for obtaining estimator

Usage

optim_est(
  X_lab,
  X_unlab,
  Y_lab,
  Yhat_lab,
  Yhat_unlab,
  w,
  theta,
  quant = NA,
  method
)

Arguments

X_lab

Array or data.frame containing observed covariates in labeled data.

X_unlab

Array or data.frame containing observed or predicted covariates in unlabeled data.

Y_lab

Array or data.frame of observed outcomes in labeled data.

Yhat_lab

Array or data.frame of predicted outcomes in labeled data.

Yhat_unlab

Array or data.frame of predicted outcomes in unlabeled data.

w

weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates).

theta

parameter theta

quant

quantile for quantile estimation

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

estimator


One-step update for obtaining the weight vector

Description

optim_weights function for One-step update for obtaining estimator

Usage

optim_weights(
  X_lab,
  X_unlab,
  Y_lab,
  Yhat_lab,
  Yhat_unlab,
  w,
  theta,
  quant = NA,
  method
)

Arguments

X_lab

Array or data.frame containing observed covariates in labeled data.

X_unlab

Array or data.frame containing observed or predicted covariates in unlabeled data.

Y_lab

Array or data.frame of observed outcomes in labeled data.

Yhat_lab

Array or data.frame of predicted outcomes in labeled data.

Yhat_unlab

Array or data.frame of predicted outcomes in unlabeled data.

w

weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates).

theta

parameter theta

quant

quantile for quantile estimation

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

weights


PostPI OLS (Analytic Correction)

Description

Helper function for PostPI OLS estimation (analytic correction)

Usage

postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u, scale_se = TRUE, n_t = Inf)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

scale_se

(boolean): Logical argument to scale relationship model error variance. Defaults to TRUE; FALSE option is retained for posterity.

n_t

(integer, optional) Size of the dataset used to train the prediction function (necessary if n_t < nrow(X_l). Defaults to Inf.

Details

Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) https://www.pnas.org/doi/abs/10.1073/pnas.2001238117

Value

A list of outputs: estimate of the inference model parameters and corresponding standard error estimate.

Examples

dat <- simdat(model = "ols")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u)

PostPI Logistic Regression (Bootstrap Correction)

Description

Helper function for PostPI logistic regression (bootstrap correction)

Usage

postpi_boot_logistic(
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  nboot = 100,
  se_type = "par",
  seed = NULL
)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

nboot

(integer): Number of bootstrap samples. Defaults to 100.

se_type

(string): Which method to calculate the standard errors. Options include "par" (parametric) or "npar" (nonparametric). Defaults to "par".

seed

(optional) An integer seed for random number generation.

Details

Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) https://www.pnas.org/doi/abs/10.1073/pnas.2001238117

Value

A list of outputs: estimate of inference model parameters and corresponding standard error based on both parametric and non-parametric bootstrap methods.

Examples

dat <- simdat(model = "logistic")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

postpi_boot_logistic(X_l, Y_l, f_l, X_u, f_u, nboot = 200)

PostPI OLS (Bootstrap Correction)

Description

Helper function for PostPI OLS estimation (bootstrap correction)

Usage

postpi_boot_ols(
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  nboot = 100,
  se_type = "par",
  rel_func = "lm",
  scale_se = TRUE,
  n_t = Inf,
  seed = NULL
)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

nboot

(integer): Number of bootstrap samples. Defaults to 100.

se_type

(string): Which method to calculate the standard errors. Options include "par" (parametric) or "npar" (nonparametric). Defaults to "par".

rel_func

(string): Method for fitting the relationship model. Options include "lm" (linear model), "rf" (random forest), and "gam" (generalized additive model). Defaults to "lm".

scale_se

(boolean): Logical argument to scale relationship model error variance. Defaults to TRUE; FALSE option is retained for posterity.

n_t

(integer, optional) Size of the dataset used to train the prediction function (necessary if n_t < nrow(X_l). Defaults to Inf.

seed

(optional) An integer seed for random number generation.

Details

Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) https://www.pnas.org/doi/abs/10.1073/pnas.2001238117

Value

A list of outputs: estimate of inference model parameters and corresponding standard error based on both parametric and non-parametric bootstrap methods.

Examples

dat <- simdat(model = "ols")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

postpi_boot_ols(X_l, Y_l, f_l, X_u, f_u, nboot = 200)

PPI Logistic Regression

Description

Helper function for PPI logistic regression

Usage

ppi_logistic(X_l, Y_l, f_l, X_u, f_u, opts = NULL)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

opts

(list, optional): Options to pass to the optimizer. See ?optim for details.

Details

Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000

Value

(list): A list containing the following:

est

(vector): vector of PPI logistic regression coefficient estimates.

se

(vector): vector of standard errors of the coefficients.

rectifier_est

(vector): vector of the rectifier logistic regression coefficient estimates.

var_u

(matrix): covariance matrix for the gradients in the unlabeled data.

var_l

(matrix): covariance matrix for the gradients in the labeled data.

grads

(matrix): matrix of gradients for the labeled data.

grads_hat_unlabeled

(matrix): matrix of predicted gradients for the unlabeled data.

grads_hat

(matrix): matrix of predicted gradients for the labeled data.

inv_hessian

(matrix): inverse Hessian matrix.

Examples

dat <- simdat(model = "logistic")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_logistic(X_l, Y_l, f_l, X_u, f_u)

PPI Mean Estimation

Description

Helper function for PPI mean estimation

Usage

ppi_mean(Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided")

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05.

alternative

(string): Alternative hypothesis. Must be one of "two-sided", "less", or "greater".

Details

Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000

Value

tuple: Lower and upper bounds of the prediction-powered confidence interval for the mean.

Examples

dat <- simdat(model = "mean")

form <- Y - f ~ 1

Y_l <- dat[dat$set == "labeled",   all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled",   all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_mean(Y_l, f_l, f_u)

PPI OLS

Description

Helper function for prediction-powered inference for OLS estimation

Usage

ppi_ols(X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000

Value

(list): A list containing the following:

est

(vector): vector of PPI OLS regression coefficient estimates.

se

(vector): vector of standard errors of the coefficients.

rectifier_est

(vector): vector of the rectifier OLS regression coefficient estimates.

Examples

dat <- simdat()

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_ols(X_l, Y_l, f_l, X_u, f_u)

PPI++ Logistic Regression

Description

Helper function for PPI++ logistic regression

Usage

ppi_plusplus_logistic(
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  lhat = NULL,
  coord = NULL,
  opts = NULL,
  w_l = NULL,
  w_u = NULL
)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

lhat

(float, optional): Power-tuning parameter (see https://arxiv.org/abs/2311.01453). The default value, NULL, will estimate the optimal value from the data. Setting lhat = 1 recovers PPI with no power tuning, and setting lhat = 0 recovers the classical point estimate.

coord

(int, optional): Coordinate for which to optimize lhat = 1. If NULL, it optimizes the total variance over all coordinates. Must be in (1, ..., d) where d is the dimension of the estimand.

opts

(list, optional): Options to pass to the optimizer. See ?optim for details.

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453

Value

(list): A list containing the following:

est

(vector): vector of PPI++ logistic regression coefficient estimates.

se

(vector): vector of standard errors of the coefficients.

lambda

(float): estimated power-tuning parameter.

rectifier_est

(vector): vector of the rectifier logistic regression coefficient estimates.

var_u

(matrix): covariance matrix for the gradients in the unlabeled data.

var_l

(matrix): covariance matrix for the gradients in the labeled data.

grads

(matrix): matrix of gradients for the labeled data.

grads_hat_unlabeled

(matrix): matrix of predicted gradients for the unlabeled data.

grads_hat

(matrix): matrix of predicted gradients for the labeled data.

inv_hessian

(matrix): inverse Hessian matrix.

Examples

dat <- simdat(model = "logistic")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_logistic(X_l, Y_l, f_l, X_u, f_u)

PPI++ Logistic Regression (Point Estimate)

Description

Helper function for PPI++ logistic regression (point estimate)

Usage

ppi_plusplus_logistic_est(
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  lhat = NULL,
  coord = NULL,
  opts = NULL,
  w_l = NULL,
  w_u = NULL
)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

lhat

(float, optional): Power-tuning parameter (see https://arxiv.org/abs/2311.01453). The default value, NULL, will estimate the optimal value from the data. Setting lhat = 1 recovers PPI with no power tuning, and setting lhat = 0 recovers the classical point estimate.

coord

(int, optional): Coordinate for which to optimize lhat = 1. If NULL, it optimizes the total variance over all coordinates. Must be in (1, ..., d) where d is the dimension of the estimand.

opts

(list, optional): Options to pass to the optimizer. See ?optim for details.

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'

Value

(vector): vector of prediction-powered point estimates of the logistic regression coefficients.

Examples

dat <- simdat(model = "logistic")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u)

PPI++ Mean Estimation

Description

Helper function for PPI++ mean estimation

Usage

ppi_plusplus_mean(
  Y_l,
  f_l,
  f_u,
  alpha = 0.05,
  alternative = "two-sided",
  lhat = NULL,
  coord = NULL,
  w_l = NULL,
  w_u = NULL
)

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05.

alternative

(string): Alternative hypothesis. Must be one of "two-sided", "less", or "greater".

lhat

(float, optional): Power-tuning parameter (see https://arxiv.org/abs/2311.01453). The default value, NULL, will estimate the optimal value from the data. Setting lhat = 1 recovers PPI with no power tuning, and setting lhat = 0 recovers the classical point estimate.

coord

(int, optional): Coordinate for which to optimize lhat = 1. If NULL, it optimizes the total variance over all coordinates. Must be in (1, ..., d) where d is the dimension of the estimand.

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'

Value

tuple: Lower and upper bounds of the prediction-powered confidence interval for the mean.

Examples

dat <- simdat(model = "mean")

form <- Y - f ~ 1

Y_l <- dat[dat$set == "labeled",   all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled",   all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_mean(Y_l, f_l, f_u)

PPI++ Mean Estimation (Point Estimate)

Description

Helper function for PPI++ mean estimation (point estimate)

Usage

ppi_plusplus_mean_est(
  Y_l,
  f_l,
  f_u,
  lhat = NULL,
  coord = NULL,
  w_l = NULL,
  w_u = NULL
)

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

lhat

(float, optional): Power-tuning parameter (see https://arxiv.org/abs/2311.01453). The default value, NULL, will estimate the optimal value from the data. Setting lhat = 1 recovers PPI with no power tuning, and setting lhat = 0 recovers the classical point estimate.

coord

(int, optional): Coordinate for which to optimize lhat = 1. If NULL, it optimizes the total variance over all coordinates. Must be in (1, ..., d) where d is the dimension of the estimand.

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453

Value

float or ndarray: Prediction-powered point estimate of the mean.

Examples

dat <- simdat(model = "mean")

form <- Y - f ~ 1

Y_l <- dat[dat$set == "labeled",   all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled",   all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_mean_est(Y_l, f_l, f_u)

PPI++ OLS

Description

Helper function for PPI++ OLS estimation

Usage

ppi_plusplus_ols(
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  lhat = NULL,
  coord = NULL,
  w_l = NULL,
  w_u = NULL
)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

lhat

(float, optional): Power-tuning parameter (see https://arxiv.org/abs/2311.01453). The default value, NULL, will estimate the optimal value from the data. Setting lhat = 1 recovers PPI with no power tuning, and setting lhat = 0 recovers the classical point estimate.

coord

(int, optional): Coordinate for which to optimize lhat = 1. If NULL, it optimizes the total variance over all coordinates. Must be in (1, ..., d) where d is the dimension of the estimand.

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'

Value

(list): A list containing the following:

est

(vector): vector of PPI++ OLS regression coefficient estimates.

se

(vector): vector of standard errors of the coefficients.

lambda

(float): estimated power-tuning parameter.

rectifier_est

(vector): vector of the rectifier OLS regression coefficient estimates.

var_u

(matrix): covariance matrix for the gradients in the unlabeled data.

var_l

(matrix): covariance matrix for the gradients in the labeled data.

grads

(matrix): matrix of gradients for the labeled data.

grads_hat_unlabeled

(matrix): matrix of predicted gradients for the unlabeled data.

grads_hat

(matrix): matrix of predicted gradients for the labeled data.

inv_hessian

(matrix): inverse Hessian matrix.

Examples

dat <- simdat(model = "ols")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_ols(X_l, Y_l, f_l, X_u, f_u)

PPI++ OLS (Point Estimate)

Description

Helper function for PPI++ OLS estimation (point estimate)

Usage

ppi_plusplus_ols_est(
  X_l,
  Y_l,
  f_l,
  X_u,
  f_u,
  lhat = NULL,
  coord = NULL,
  w_l = NULL,
  w_u = NULL
)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

lhat

(float, optional): Power-tuning parameter (see https://arxiv.org/abs/2311.01453). The default value, NULL, will estimate the optimal value from the data. Setting lhat = 1 recovers PPI with no power tuning, and setting lhat = 0 recovers the classical point estimate.

coord

(int, optional): Coordinate for which to optimize lhat = 1. If NULL, it optimizes the total variance over all coordinates. Must be in (1, ..., d) where d is the dimension of the estimand.

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453

Value

(vector): vector of prediction-powered point estimates of the OLS coefficients.

Examples

dat <- simdat(model = "ols")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u)

PPI++ Quantile Estimation

Description

Helper function for PPI++ quantile estimation

Usage

ppi_plusplus_quantile(
  Y_l,
  f_l,
  f_u,
  q,
  alpha = 0.05,
  exact_grid = FALSE,
  w_l = NULL,
  w_u = NULL
)

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

q

(float): Quantile to estimate. Must be in the range (0, 1).

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05.

exact_grid

(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE).

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453

Value

tuple: Lower and upper bounds of the prediction-powered confidence interval for the quantile.

Examples

dat <- simdat(model = "quantile")

form <- Y - f ~ X1

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_quantile(Y_l, f_l, f_u, q = 0.5)

PPI++ Quantile Estimation (Point Estimate)

Description

Helper function for PPI++ quantile estimation (point estimate)

Usage

ppi_plusplus_quantile_est(
  Y_l,
  f_l,
  f_u,
  q,
  exact_grid = FALSE,
  w_l = NULL,
  w_u = NULL
)

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

q

(float): Quantile to estimate. Must be in the range (0, 1).

exact_grid

(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE).

w_l

(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones.

w_u

(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones.

Details

PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) https://arxiv.org/abs/2311.01453'

Value

(float): Prediction-powered point estimate of the quantile.

Examples

dat <- simdat(model = "quantile")

form <- Y - f ~ 1

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_plusplus_quantile_est(Y_l, f_l, f_u, q = 0.5)

PPI Quantile Estimation

Description

Helper function for PPI quantile estimation

Usage

ppi_quantile(Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE)

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

q

(float): Quantile to estimate. Must be in the range (0, 1).

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05.

exact_grid

(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE).

Details

Prediction Powered Inference (Angelopoulos et al., 2023) https://www.science.org/doi/10.1126/science.adi6000

Value

tuple: Lower and upper bounds of the prediction-powered confidence interval for the quantile.

Examples

dat <- simdat(model = "quantile")

form <- Y - f ~ X1

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

ppi_quantile(Y_l, f_l, f_u, q = 0.5)

Print IPD Fit

Description

Prints a brief summary of the IPD method/model combination.

Usage

## S3 method for class 'ipd'
print(x, ...)

Arguments

x

An object of class ipd.

...

Additional arguments to be passed to the print function.

Value

The input x, invisibly.

Examples

#-- Generate Example Data

set.seed(2023)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- Fit IPD

fit <- ipd(formula, method = "postpi_analytic", model = "ols",

  data = dat, label = "set")

#-- Print Output

print(fit)

Print Summary of IPD Fit

Description

Prints a detailed summary of the IPD method/model combination.

Usage

## S3 method for class 'summary.ipd'
print(x, ...)

Arguments

x

An object of class summary.ipd.

...

Additional arguments to be passed to the print function.

Value

The input x, invisibly.

Examples

#-- Generate Example Data

set.seed(2023)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- Fit IPD

fit <- ipd(formula, method = "postpi_analytic", model = "ols",

  data = dat, label = "set")

#-- Summarize Output

summ_fit <- summary(fit)

print(summ_fit)

Estimating equation

Description

psi function for estimating equation

Usage

psi(X, Y, theta, quant = NA, method)

Arguments

X

Array or data.frame containing covariates

Y

Array or data.frame of outcomes

theta

parameter theta

quant

quantile for quantile estimation

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

esimating equation


PSPA Logistic Regression

Description

Helper function for PSPA logistic regression

Usage

pspa_logistic(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of binary labeled outcomes.

f_l

(vector): n-vector of binary predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of binary predictions in the unlabeled data.

weights

(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified.

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05

Details

Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220

Value

A list of outputs: estimate of inference model parameters and corresponding standard error.

Examples

dat <- simdat(model = "logistic")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

pspa_logistic(X_l, Y_l, f_l, X_u, f_u)

PSPA Mean Estimation

Description

Helper function for PSPA mean estimation

Usage

pspa_mean(Y_l, f_l, f_u, weights = NA, alpha = 0.05)

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

weights

(array): 1-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified.

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05.

Details

Post-prediction adaptive inference (Miao et al., 2023) https://arxiv.org/abs/2311.14220

Value

A list of outputs: estimate of inference model parameters and corresponding standard error.

Examples

dat <- simdat(model = "mean")

form <- Y - f ~ 1

Y_l <- dat[dat$set == "labeled",   all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled",   all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

pspa_mean(Y_l, f_l, f_u)

PSPA OLS Estimation

Description

Helper function for PSPA OLS for linear regression

Usage

pspa_ols(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

weights

(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified.

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05.

Details

Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220

Value

A list of outputs: estimate of inference model parameters and corresponding standard error.

Examples

dat <- simdat(model = "ols")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

pspa_ols(X_l, Y_l, f_l, X_u, f_u)

PSPA Poisson Regression

Description

Helper function for PSPA Poisson regression

Usage

pspa_poisson(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)

Arguments

X_l

(matrix): n x p matrix of covariates in the labeled data.

Y_l

(vector): n-vector of count labeled outcomes.

f_l

(vector): n-vector of binary predictions in the labeled data.

X_u

(matrix): N x p matrix of covariates in the unlabeled data.

f_u

(vector): N-vector of binary predictions in the unlabeled data.

weights

(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified.

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05

Details

Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220

Value

A list of outputs: estimate of inference model parameters and corresponding standard error.

Examples

dat <- simdat(model = "poisson")

form <- Y - f ~ X1

X_l <- model.matrix(form, data = dat[dat$set == "labeled",])

Y_l <- dat[dat$set == "labeled", all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled", all.vars(form)[2]] |> matrix(ncol = 1)

X_u <- model.matrix(form, data = dat[dat$set == "unlabeled",])

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

pspa_poisson(X_l, Y_l, f_l, X_u, f_u)

PSPA Quantile Estimation

Description

Helper function for PSPA quantile estimation

Usage

pspa_quantile(Y_l, f_l, f_u, q, weights = NA, alpha = 0.05)

Arguments

Y_l

(vector): n-vector of labeled outcomes.

f_l

(vector): n-vector of predictions in the labeled data.

f_u

(vector): N-vector of predictions in the unlabeled data.

q

(float): Quantile to estimate. Must be in the range (0, 1).

weights

(array): 1-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified.

alpha

(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05.

Details

Post-prediction adaptive inference (Miao et al. 2023) https://arxiv.org/abs/2311.14220

Value

A list of outputs: estimate of inference model parameters and corresponding standard error.

Examples

dat <- simdat(model = "quantile")

form <- Y - f ~ 1

Y_l <- dat[dat$set == "labeled",   all.vars(form)[1]] |> matrix(ncol = 1)

f_l <- dat[dat$set == "labeled",   all.vars(form)[2]] |> matrix(ncol = 1)

f_u <- dat[dat$set == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1)

pspa_quantile(Y_l, f_l, f_u, q = 0.5)

PSPA M-Estimation for ML-predicted labels

Description

pspa_y function conducts post-prediction M-Estimation.

Usage

pspa_y(
  X_lab = NA,
  X_unlab = NA,
  Y_lab,
  Yhat_lab,
  Yhat_unlab,
  alpha = 0.05,
  weights = NA,
  quant = NA,
  intercept = FALSE,
  method
)

Arguments

X_lab

Array or data.frame containing observed covariates in labeled data.

X_unlab

Array or data.frame containing observed or predicted covariates in unlabeled data.

Y_lab

Array or data.frame of observed outcomes in labeled data.

Yhat_lab

Array or data.frame of predicted outcomes in labeled data.

Yhat_unlab

Array or data.frame of predicted outcomes in unlabeled data.

alpha

Specifies the confidence level as 1 - alpha for confidence intervals.

weights

weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates).

quant

quantile for quantile estimation

intercept

Boolean indicating if the input covariates' data contains the intercept (TRUE if the input data contains)

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

A summary table presenting point estimates, standard error, confidence intervals (1 - alpha), P-values, and weights.

Examples

data <- sim_data_y()

X_lab <- data$X_lab

X_unlab <- data$X_unlab

Y_lab <- data$Y_lab

Yhat_lab <- data$Yhat_lab

Yhat_unlab <- data$Yhat_unlab

pspa_y(X_lab = X_lab, X_unlab = X_unlab,

 Y_lab = Y_lab, Yhat_lab = Yhat_lab, Yhat_unlab = Yhat_unlab,

 alpha = 0.05, method = "ols")

Rectified CDF

Description

Computes the rectified CDF of the data.

Usage

rectified_cdf(Y_l, f_l, f_u, grid, w_l = NULL, w_u = NULL)

Arguments

Y_l

(vector): Gold-standard labels.

f_l

(vector): Predictions corresponding to the gold-standard labels.

f_u

(vector): Predictions corresponding to the unlabeled data.

grid

(vector): Grid of values to compute the CDF at.

w_l

(vector, optional): Sample weights for the labeled data set.

w_u

(vector, optional): Sample weights for the unlabeled data set.

Value

(vector): Rectified CDF of the data at the specified grid points.

Examples

Y_l <- c(1, 2, 3, 4, 5)

f_l <- c(1.1, 2.2, 3.3, 4.4, 5.5)

f_u <- c(1.2, 2.3, 3.4)

grid <- seq(0, 6, by = 0.5)

rectified_cdf(Y_l, f_l, f_u, grid)

Rectified P-Value

Description

Computes a rectified p-value.

Usage

rectified_p_value(
  rectifier,
  rectifier_std,
  imputed_mean,
  imputed_std,
  null = 0,
  alternative = "two-sided"
)

Arguments

rectifier

(float or vector): Rectifier value.

rectifier_std

(float or vector): Rectifier standard deviation.

imputed_mean

(float or vector): Imputed mean.

imputed_std

(float or vector): Imputed standard deviation.

null

(float, optional): Value of the null hypothesis to be tested. Defaults to 0.

alternative

(str, optional): Alternative hypothesis, either 'two-sided', 'larger' or 'smaller'.

Value

(float or vector): The rectified p-value.

Examples

rectifier <- 0.7

rectifier_std <- 0.5

imputed_mean <- 1.5

imputed_std <- 0.3

rectified_p_value(rectifier, rectifier_std, imputed_mean, imputed_std)

Variance-covariance matrix of the estimation equation

Description

Sigma_cal function for variance-covariance matrix of the estimation equation

Usage

Sigma_cal(
  X_lab,
  X_unlab,
  Y_lab,
  Yhat_lab,
  Yhat_unlab,
  w,
  theta,
  quant = NA,
  method
)

Arguments

X_lab

Array or data.frame containing observed covariates in labeled data.

X_unlab

Array or data.frame containing observed or predicted covariates in unlabeled data.

Y_lab

Array or data.frame of observed outcomes in labeled data.

Yhat_lab

Array or data.frame of predicted outcomes in labeled data.

Yhat_unlab

Array or data.frame of predicted outcomes in unlabeled data.

w

weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates).

theta

parameter theta

quant

quantile for quantile estimation

method

indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson".

Value

variance-covariance matrix of the estimation equation


Simulate the data for testing the functions

Description

sim_data_y for simulation with ML-predicted Y

Usage

sim_data_y(r = 0.9, binary = FALSE)

Arguments

r

imputation correlation

binary

simulate binary outcome or not

Value

simulated data


Data generation function for various underlying models

Description

Data generation function for various underlying models

Usage

simdat(
  n = c(300, 300, 300),
  effect = 1,
  sigma_Y = 1,
  model = "ols",
  shift = 0,
  scale = 1
)

Arguments

n

Integer vector of size 3 indicating the sample sizes in the training, labeled, and unlabeled data sets, respectively

effect

Regression coefficient for the first variable of interest for inference. Defaults is 1.

sigma_Y

Residual variance for the generated outcome. Defaults is 1.

model

The type of model to be generated. Must be one of "mean", "quantile", "ols", or "logistic". Default is "ols".

shift

Scalar shift of the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 0.

scale

Scaling factor for the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 1.

Value

A data.frame containing n rows and columns corresponding to the labeled outcome (Y), the predicted outcome (f), a character variable (set) indicating which data set the observation belongs to (training, labeled, or unlabeled), and four independent, normally distributed predictors (X1, X2, X3, and X4), where applicable.

Examples

#-- Mean

dat_mean <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1,

  model = "mean")

head(dat_mean)

#-- Linear Regression

dat_ols <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1,

  model = "ols")

head(dat_ols)

Summarize IPD Fit

Description

Produces a summary of the IPD method/model combination.

Usage

## S3 method for class 'ipd'
summary(object, ...)

Arguments

object

An object of class ipd.

...

Additional arguments to be passed to the summary function.

Value

A list containing:

coefficients

Model coefficients and related statistics.

performance

Performance metrics of the model fit.

...

Additional summary information.

Examples

#-- Generate Example Data

set.seed(2023)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- Fit IPD

fit <- ipd(formula, method = "postpi_analytic", model = "ols",

  data = dat, label = "set")

#-- Summarize Output

summ_fit <- summary(fit)

summ_fit

Tidy an IPD Fit

Description

Tidies the IPD method/model fit into a data frame.

Usage

## S3 method for class 'ipd'
tidy(x, ...)

Arguments

x

An object of class ipd.

...

Additional arguments to be passed to the tidy function.

Value

A tidy data frame of the model's coefficients.

Examples

#-- Generate Example Data

set.seed(2023)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- Fit IPD

fit <- ipd(formula, method = "postpi_analytic", model = "ols",

  data = dat, label = "set")

#-- Tidy Output

tidy(fit)

Weighted Least Squares

Description

Computes the weighted least squares estimate of the coefficients.

Usage

wls(X, Y, w = NULL, return_se = FALSE)

Arguments

X

(matrix): n x p matrix of covariates.

Y

(vector): p-vector of outcome values.

w

(vector, optional): n-vector of sample weights.

return_se

(bool, optional): Whether to return the standard errors of the coefficients.

Value

(list): A list containing the following:

theta

(vector): p-vector of weighted least squares estimates of the coefficients.

se

(vector): If return_se == TRUE, return the p-vector of standard errors of the coefficients.

Examples

n <- 1000

X <- rnorm(n, 1, 1)

w <- rep(1, n)

Y <- X + rnorm(n, 0, 1)

wls(X, Y, w = w, return_se = TRUE)

Normal Confidence Intervals

Description

Calculates normal confidence intervals for a given alternative at a given significance level.

Usage

zconfint_generic(mean, std_mean, alpha, alternative)

Arguments

mean

(float): Estimated normal mean.

std_mean

(float): Estimated standard error of the mean.

alpha

(float): Significance level in [0,1]

alternative

(string): Alternative hypothesis, either 'two-sided', 'larger' or 'smaller'.

Value

(vector): Lower and upper (1 - alpha) * 100% confidence limits.

Examples

n <- 1000

Y <- rnorm(n, 1, 1)

se_Y <-  sd(Y) / sqrt(n)

zconfint_generic(Y, se_Y, alpha = 0.05, alternative = "two-sided")

Compute Z-Statistic and P-Value

Description

Computes the z-statistic and the corresponding p-value for a given test.

Usage

zstat_generic(value1, value2, std_diff, alternative, diff = 0)

Arguments

value1

(numeric): The first value or sample mean.

value2

(numeric): The second value or sample mean.

std_diff

(numeric): The standard error of the difference between the two values.

alternative

(character): The alternative hypothesis. Can be one of "two-sided" (or "2-sided", "2s"), "larger" (or "l"), or "smaller" (or "s").

diff

(numeric, optional): The hypothesized difference between the two values. Default is 0.

Value

(list): A list containing the following:

zstat

(numeric): The computed z-statistic.

pvalue

(numeric): The corresponding p-value for the test.

Examples

value1 <- 1.5

value2 <- 1.0

std_diff <- 0.2

alternative <- "two-sided"

result <- zstat_generic(value1, value2, std_diff, alternative)