catalytic_glm_gaussian

Introduction

This vignette provides an overview of how to use the functions in the catalytic package that focuses on GLM Linear Regression. The other catalytic vignettes go into other model-estimating functions.

The goal of the catalytic package is to build framework for catalytic prior distributions. Stabilizing high-dimensional working models by shrinking them towards simplified models. This is achieved by supplementing observed data with weighted synthetic data generated from a predictive distribution under the simpler model. For more information, see (Huang et al. 2020).

The two steps of using catalytic package for GLM Linear Regression are

  1. Initialization: The cat_glm_initialization function constructs a cat_init object based on the formula provided by the user to generate synthetic data. The resulting cat_init object is tailored to facilitate further analysis, and is integral for subsequent modeling steps in the catalytic package.

  2. Choose Method(s): Users have the flexibility to choose from four main functions within the catalytic package: cat_glm, cat_glm_tune, cat_glm_bayes, and cat_glm_bayes_joint. Each function serves a specific purpose in modeling with catalytic priors and offers distinct capabilities tailored to different modeling scenarios for GLM Linear Regression. This approach enables users to seamlessly incorporate synthetic data with varying weights from different method into GLM Linear Regression analyses, providing flexibility and control over the modeling process.

Data Preparation

Creating a high-dimensional dataset with a low data size. This step involves increasing the number of features (dimensions) while keeping the number of observations (data size) relatively small. This is useful for testing the performance of catalytic models in high-dimensional settings.

A randomly generated dataset is split into training (train_data) and test (test_data) datasets.

library(catalytic)

set.seed(1)

n <- 20 # Number of observations
p <- 5 # Number of predictors

obs_x <- matrix(rnorm(n * (p - 1)), ncol = (p - 1)) # Observation covariates
true_coefs <- rnorm(p) # True coefficient
noise <- rnorm(n) # Noise for more response variability
obs_y <- true_coefs[1] + obs_x %*% true_coefs[-1] + noise # Observation response

obs_data <- as.data.frame(cbind(obs_x, obs_y))
names(obs_data) <- c(paste0("X", 1:(p - 1)), "Y")

# Seperate observation data into train and test data
train_idx <- sample(n, 10)
train_data <- obs_data[train_idx, ]
test_data <- obs_data[-train_idx, ]

print(dim(train_data))
#> [1] 10  5

In this section, we explore the foundational steps of fitting a Linear regression model (GLM) using the stats::glm function with the gaussian family.

# Fit a Linear regression model (GLM)
glm_model <- stats::glm(
  formula = Y ~ .,
  family = gaussian,
  data = train_data
)

predicted_y <- predict(
  glm_model,
  newdata = test_data
)

cat(
  "MLE GLM gaussian Model - Mean Square Error (Data):",
  mean((predicted_y - test_data$Y)^2)
)
#> MLE GLM gaussian Model - Mean Square Error (Data): 5.713175

cat(
  "\nMLE GLM gaussian Model - Sum Square Error (Coefficients):",
  sum((coef(glm_model) - true_coefs)^2)
)
#> 
#> MLE GLM gaussian Model - Sum Square Error (Coefficients): 4.682472

Let us check the scatter plot of the predicted_y from glm_model versus the test_data$Y, this can be a great way to visually assess the accuracy and performance of the model.

plot(test_data$Y,
  predicted_y,
  main = "Scatter Plot of true Y vs Predicted Y",
  xlab = "true Y",
  ylab = "Predicted Y",
  pch = 19,
  col = "blue"
)

# Add a 45-degree line for reference
abline(a = 0, b = 1, col = "red", lwd = 2)

Usage of catalytic

Step 1: Initialization

To initialize data for GLM Linear Regression using the catalytic package, the cat_glm_initialization function is employed. This function facilitates the setup by preparing synthetic data tailored for modeling purposes.

Here’s a breakdown of the parameters used:

  • formula: A formula specifying the GLMs. Should include response and predictor variables.
  • family: The type of GLM family. Defaults to Gaussian.
  • data: A data frame containing the data for modeling.
  • syn_size: An integer specifying the size of the synthetic dataset to be generated. Default is four times the number of predictor columns.
  • custom_variance: A custom variance value to be applied if using a Gaussian model. Defaults to NULL.
  • gaussian_known_variance: A logical value indicating whether the data variance is known. Defaults to FALSE. Only applicable to Gaussian family.
  • x_degree: A numeric vector indicating the degree for polynomial expansion of predictors. Default is 1 for each predictor.
  • resample_only: A logical indicating whether to perform resampling only. Default is FALSE.
  • na_replace: A function to handle NA values in the data. Default is stats::na.omit.
cat_init <- cat_glm_initialization(
  formula = Y ~ 1,
  family = gaussian,
  data = train_data,
  syn_size = 50,
  custom_variance = NULL,
  gaussian_known_variance = FALSE,
  x_degree = NULL,
  resample_only = FALSE,
  na_replace = stats::na.omit
)

cat_init
#> cat_glm_initialization
#>  formula:               Y ~ 1
#>  model:                 Unknown Variance
#>  custom variance:       NULL
#>  family:                gaussian [identity]
#>  covariates dimention:  10 (Observation) + 50 (Synthetic) = 60 rows with 4 column(s)
#> ------
#> Observation Data Information:
#>  covariates dimention:  10 rows with  4 column(s)
#>  head(data) :              
#>  [only show the head of first 10 columns] 
#> 
#>            X1          X2         X3         X4          Y
#> 3  -0.8356286  0.07456498  0.6969634  0.6897394 -1.3242517
#> 20  0.5939013  0.76317575 -0.1350546 -0.5895209 -0.5488277
#> 14 -2.2146999 -0.05380504 -1.1293631 -0.9340976 -0.3914319
#> 13 -0.6212406  0.38767161  0.3411197  0.6107264 -0.7582249
#> 11  1.5117812  1.35867955  0.3981059  0.4755095  1.0619861
#> 8   0.7383247 -1.47075238  0.7685329  1.4655549 -1.5411966
#> 
#> ------
#> Synthetic Data Information:
#>  covariates dimention:  50 rows with  4  column(s)
#>  head(data):              
#>  [only show the head of first 10 columns] 
#> 
#>           X1          X2         X3         X4          Y
#> 1  0.1836433  0.07456498  0.3411197  0.6897394 -0.2004061
#> 2 -0.3053884  0.41794156 -1.1293631  0.6107264 -0.2004061
#> 3 -0.6212406  0.38767161  0.7685329  0.6107264 -0.2004061
#> 4  1.5117812 -0.10278773 -0.6120264  0.6897394 -0.2004061
#> 5  0.3898432 -0.10278773 -0.1123462 -0.0392400 -0.2004061
#> 6  1.5117812 -0.10278773  0.3981059  0.4755095 -0.2004061
#> 
#>  data generation process:
#>  [only show the first 10 columns] 
#> 
#>     Covariate        Type     Process
#> 1 X1          Continuous  Coordinate 
#> 2 X2          Continuous  Coordinate 
#> 3 X3          Continuous  Coordinate 
#> 4 X4          Continuous  Coordinate 
#> 
#> ------
#> * For help interpreting the printed output see ?print.cat_initialization

Here shows how users can simplify the input for cat_glm_initialization. User do not have to specify syn_size and other parameters, as they have default values, which mentioned above. cat_init objects contain a list of attributes, which is typically generated from above function cat_glm_initialization. These attributes provide comprehensive information for below modeling tasks or for user check.

Here’s a breakdown of all attributes except the input parameters:

  • function_name: The name of this function, which is cat_glm_initialization.

  • obs_size: The number of observations (rows) in the original dataset (obs_data).

  • obs_data: The original dataset (data).

  • obs_x: The covariates from original dataset (obs_data).

  • obs_y: The response from original dataset (obs_data).

  • syn_size: The size of synthetic data generated.

  • syn_data: The synthetic data created for modeling purposes, based on the original dataset (obs_data) characteristics.

  • syn_x: The covariates from synthetic dataset (syn_data).

  • syn_y: The response from synthetic dataset (syn_data).

  • syn_x_resample_inform: The information detailing the process of resampling synthetic data.

  • size: The total size of the combined dataset (obs_size and syn_size).

  • data: The combined dataset (obs_data and syn_data).

  • x: The combined covariates (obs_x and syn_x)

  • y: The combined response (obs_y and syn_y)

For more details, please check ?cat_glm_initialization.

names(cat_init)
#>  [1] "function_name"           "formula"                
#>  [3] "family"                  "syn_size"               
#>  [5] "custom_variance"         "gaussian_known_variance"
#>  [7] "x_degree"                "resample_only"          
#>  [9] "na_replace"              "y_col_name"             
#> [11] "simple_model"            "obs_size"               
#> [13] "obs_data"                "obs_x"                  
#> [15] "obs_y"                   "syn_size"               
#> [17] "syn_data"                "syn_x"                  
#> [19] "syn_y"                   "syn_x_resample_inform"  
#> [21] "size"                    "data"                   
#> [23] "x"                       "y"

And of course, user can extract items mentioned above from cat_glm_initialization object.

# The number of observations (rows) in the original dataset (`obs_data`)
cat_init$obs_size
#> [1] 10

# The information detailing the process of resampling synthetic data
cat_init$syn_x_resample_inform
#>   Covariate       Type    Process
#> 1        X1 Continuous Coordinate
#> 2        X2 Continuous Coordinate
#> 3        X3 Continuous Coordinate
#> 4        X4 Continuous Coordinate

Step 2.1: Choose Method(s) - Estimation with Fixed tau

The cat_glm function fits a Generalized Linear Model (GLM) with a catalytic prior on the regression coefficients. It utilizes information from the cat_init object generated during the initialization step, which includes both observed and synthetic data, plus other relevant information.

The GLM model is then fitted using the specified formula, family, and a single tau(synthetic data down-weight factor). The resulting cat_glm object encapsulates the fitted model, including estimated coefficients and family information, facilitating further analysis.

Here’s a breakdown of the parameters used:

  • formula: This parameter specifies the model formula used in the GLM (Generalized Linear Model). It defines the relationship between the response variable and the predictors. Alternatively, besides using in format RESPONSE ~ COVARIATES, user can also use ~ COVARIATES without specifying the response name, since the response name is defined in the initialization step.

  • cat_init: This parameter is essential and represents the initialization object (cat_init) created by using cat_glm_initialization. It contains both observed and synthetic data, plus other relevant information, necessary for model fitting.

  • tau: This parameter determines the down-weight assigned to synthetic data relative to observed data. It influences the influence of synthetic data in the model fitting process. If not specified (NULL), it defaults to a one forth of the number of predictors.

cat_glm_model <- cat_glm(
  formula = Y ~ ., # Same as `~ .`
  cat_init = cat_init, # Output object from `cat_glm_initialization`
  tau = 10 # Defaults to the number of predictors / 4
)

cat_glm_model
#> cat_glm
#>  formula:                Y ~ .
#>  covariates dimention:   10 (Observation) + 50 (Synthetic) = 60 rows with 4 column(s)
#>  tau:                    10
#>  family:                 gaussian [identity]
#> ------
#> coefficients' information:
#> (Intercept)          X1          X2          X3          X4 
#>      -0.391       0.061       0.600       0.045       0.207 
#> 
#> ------
#> * For help interpreting the printed output see ?print.cat

Here shows how users can simplify the input for cat_glm. User do not have to specify tau, as tau has default value , which mentioned above.

Let’s check the prediction error.

cat_glm_predicted_y <- predict(
  cat_glm_model,
  newdata = test_data
)

cat(
  "Catalytic `cat_glm` - Mean Square Error (Data):",
  mean((cat_glm_predicted_y - test_data$Y)^2)
)
#> Catalytic `cat_glm` - Mean Square Error (Data): 4.628022

cat(
  "\nCatalytic `cat_glm` - Sum Square Error (Coefficients):",
  sum((coef(cat_glm_model) - true_coefs)^2)
)
#> 
#> Catalytic `cat_glm` - Sum Square Error (Coefficients): 3.014126

Let us check the scatter plot of the cat_glm_predicted_y from cat_glm_model versus the test_data$Y, this can be a great way to visually assess the accuracy and performance of the model.

plot(test_data$Y,
  cat_glm_predicted_y,
  main = "Scatter Plot of true Y vs Predicted Y (cat_glm)",
  xlab = "true Y",
  ylab = "Predicted Y (cat_glm)",
  pch = 19,
  col = "blue"
)

# Add a 45-degree line for reference
abline(a = 0, b = 1, col = "red", lwd = 2)

Both cat_glm_model objects are outputs from the cat_glm function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

  • function_name: The name of the function (cat_glm).

  • model: The fitted GLM model object obtained from stats::glm, with tau.

  • coefficients: The estimated coefficients from the fitted GLM model model.

For more details, please check ?cat_glm.

names(cat_glm_model)
#> [1] "function_name" "formula"       "cat_init"      "tau"          
#> [5] "model"         "coefficients"

User can extract items mentioned above from cat_glm object.

# The formula used for modeling
cat_glm_model$formula
#> Y ~ .
#> <environment: 0x55ee4307b760>

# The fitted GLM model object obtained from `stats::glm` with `tau`
cat_glm_model$coefficients
#> (Intercept)          X1          X2          X3          X4 
#> -0.39135050  0.06098358  0.59964620  0.04472616  0.20656057

Step 2.2: Choose Method(s) - Estimation with Selective tau

The cat_glm_tune function fits a GLM with a catalytic prior on the regression coefficients and provides options for optimizing model performance over a range of tau values(tau_seq).

These methods empower users to fit and optimize GLM models with catalytic priors, leveraging both observed and synthetic data to enhance model performance and robustness in various statistical analyses.

Cross-validation (risk_estimate_method = “cross_validation”)

This method computes the partial likelihood across a specified range of tau values (tau_seq). It iterates through each tau value, evaluating its performance based on cross-validation folds (cv_fold_num) to select the optimal tau that minimizes the discrepancy error.

Here’s a breakdown of the parameters used:

  • formula and cat_init are same as above.

  • risk_estimate_method: Method for risk estimation, chosen from “parametric_bootstrap”, “cross_validation”, “mallowian_estimate”, “steinian_estimate”. In this example, “cross_validation” is used.

  • discrepancy_method: Method for discrepancy calculation, chosen from “mean_square_error”, “mean_classification_error”, “logistic_deviance”. In this example, “mean_square_error” is used because the family is gaussian.

  • tau_seq: Vector of positive numeric values for down-weighting synthetic data. Defaults to a sequence around one fourth of the number of predictors.

  • cross_validation_fold_num: Number of folds for cross-validation. Defaults to 5.

cat_glm_tune_cv <- cat_glm_tune(
  formula = Y ~ ., # Same as `~ .`
  cat_init = cat_init,
  risk_estimate_method = "cross_validation", # Default auto-select based on the data size
  discrepancy_method = "mean_square_error", # Default auto-select based on family
  tau_seq = seq(0, 5, 0.5), # Default is a numeric sequence around the number of predictors / 4
  cross_validation_fold_num = 2 # Default: 5
)

cat_glm_tune_cv
#> cat_glm_tune
#>  formula:                 Y ~ .
#>  covariates dimention:    10 (Observation) + 50 (Synthetic) = 60 rows with 4 column(s)
#>  tau sequnce:             0, 0.5, 1 ... 4, 4.5, 5
#>  family:                  gaussian
#>  risk estimate method:    cross_validation
#>  discrepancy method:      mean_square_error
#> 
#>  optimal tau:             0.5
#>  minimun risk estimate:   1.266
#> ------
#> coefficients' information:
#> 
#> (Intercept)          X1          X2          X3          X4 
#>      -0.682       0.171       1.138      -1.475       1.262 
#> 
#> ------
#> * For help interpreting the printed output see ?print.cat_tune

User can plot the tau_seq (x) against discrepancy error (y) using the plot() function. This plot will show the lowest discrepancy error at the optimal tau value.

plot(cat_glm_tune_cv, legend_pos = "topright", text_pos = 3)

Bootstrap (risk_estimate_method = “parametric_bootstrap”)

This method estimates tau using bootstrap resampling, refining the model through iterative sampling to enhance robustness and accuracy.

Here’s a breakdown of other parameters used:

  • tau_0: Initial tau value used for discrepancy calculation in risk estimation. Defaults to one fourth of the number of predictors for binomial and 1 for gaussian.

  • parametric_bootstrap_iteration_times: Number of bootstrap iterations for “parametric_bootstrap” risk estimation. Defaults to 100.

For the breakdown of other input parameters, please check section Cross Validation

cat_glm_tune_boots <- cat_glm_tune(
  formula = ~., # Same as `Y ~ .`
  cat_init = cat_init,
  risk_estimate_method = "parametric_bootstrap", # Default auto-select based on the data size
  discrepancy_method = "mean_square_error", # Default auto-select based on family
  tau_0 = 2, # Default: 1
  parametric_bootstrap_iteration_times = 5, # Default: 100
)

cat_glm_tune_boots
#> cat_glm_tune
#>  formula:                 Y ~ .
#>  covariates dimention:    10 (Observation) + 50 (Synthetic) = 60 rows with 4 column(s)
#>  tau sequnce:             0.01, 0.51, 1.01, 1.51, 2.01, 2.51
#>  family:                  gaussian
#>  risk estimate method:    parametric_bootstrap
#>  discrepancy method:      mean_square_error
#> 
#>  optimal tau:             2.51
#>  minimun risk estimate:   3.888
#> ------
#> coefficients' information:
#> 
#> (Intercept)          X1          X2          X3          X4 
#>      -0.501       0.072       0.908      -0.342       0.519 
#> 
#> ------
#> * For help interpreting the printed output see ?print.cat_tune

Mallowian Estimate (risk_estimate_method = ” mallowian_estimate”)

This method computes the risk estimate using a mallowian estimate approach, optimizing the model based on observed and synthetic data.

For the breakdown of the input parameters, please check section Cross Validation and Bootstrap)

cat_glm_tune_mallowian <- cat_glm_tune(
  formula = ~., # Same as `Y ~ .`
  cat_init = cat_init,
  risk_estimate_method = "mallowian_estimate", # Default auto-select based on the data size
  discrepancy_method = "mean_square_error", # Default auto-select based on family
)

cat_glm_tune_mallowian
#> cat_glm_tune
#>  formula:                 Y ~ .
#>  covariates dimention:    10 (Observation) + 50 (Synthetic) = 60 rows with 4 column(s)
#>  tau sequnce:             0.01, 0.51, 1.01, 1.51, 2.01, 2.51
#>  family:                  gaussian
#>  risk estimate method:    mallowian_estimate
#>  discrepancy method:      mean_square_error
#> 
#>  optimal tau:             0.01
#>  minimun risk estimate:   0.847
#> ------
#> coefficients' information:
#> 
#> (Intercept)          X1          X2          X3          X4 
#>      -0.882       0.314       1.299      -2.845       2.136 
#> 
#> ------
#> * For help interpreting the printed output see ?print.cat_tune

Recommendations for Choosing risk_estimate_method and discrepancy_method

Choosing the appropriate risk_estimate_method and discrepancy_method depends on the data size, model complexity, and the specific requirements of user’s analysis.

  1. Choosing risk_estimate_method
  • Small to Medium Data Size (observation data size <= 200)
    • Default: “parametric_bootstrap”
    • For smaller datasets, parametric bootstrap is generally chosen. It efficiently estimates the model’s performance using resampling techniques.
  • Large Data Size (observation data size > 200):
    • Default: “cross_validation”
    • For larger datasets, cross-validation is preferred. It provides a robust estimate of model performance by splitting the data into multiple folds and averaging the results.
  1. Setting discrepancy_method
  • GLM Family is gaussian (family == “gaussian”)
    • For gaussian GLM families, “mean_square_error” is the only option. It calculates the discrepancy based on the linear likelihood function, suitable for linear outcomes.
  • GLM Family is not gaussian (family != “gaussian”):
    • Default: “logistic_deviance”
    • For other GLM families (e.g. binomial), “logistic_deviance” is typically used. Besides “logistic_deviance”, user can also choose “classification_error” for binary outcomes. Please check catalytic_glm_binomial for more details.

Automatic Parameter Selection

Of course, user don’t need to worry about specifying these parameters explicitly, and they just need to simply provide the cat_init object and the formula. then cat_glm_tune will automatically select risk_estimate_method and discrepancy_method based on the dataset size and GLM family type.

In this example, it is risk_estimate_method = "parametric_bootstrap" and discrepancy_method = "square_error".

For the breakdown of the input parameters, please check section Cross Validation and Bootstrap

cat_glm_tune_auto <- cat_glm_tune(
  formula = ~., # Same as `Y ~ .`
  cat_init = cat_init
)

cat_glm_tune_auto
#> cat_glm_tune
#>  formula:                 Y ~ .
#>  covariates dimention:    10 (Observation) + 50 (Synthetic) = 60 rows with 4 column(s)
#>  tau sequnce:             0.01, 0.51, 1.01, 1.51, 2.01, 2.51
#>  family:                  gaussian
#>  risk estimate method:    parametric_bootstrap
#>  discrepancy method:      mean_square_error
#> 
#>  optimal tau:             2.51
#>  minimun risk estimate:   6.43
#> ------
#> coefficients' information:
#> 
#> (Intercept)          X1          X2          X3          X4 
#>      -0.501       0.072       0.908      -0.342       0.519 
#> 
#> ------
#> * For help interpreting the printed output see ?print.cat_tune

Let’s check the prediction error.

cat_glm_tune_predicted_y <- predict(
  cat_glm_tune_auto,
  newdata = test_data
)

cat(
  "Catalytic `cat_glm_tune` - Mean Square Error (Data):",
  mean((cat_glm_tune_predicted_y - test_data$Y)^2)
)

cat(
  "\nCatalytic `cat_glm_tune` - Sum Square Error (Coefficients):",
  sum((coef(cat_glm_tune_auto) - true_coefs)^2)
)

Let us check the scatter plot of the cat_glm_tune_predicted_y from cat_glm_tune_auto versus the test_data$Y, this can be a great way to visually assess the accuracy and performance of the model.

plot(test_data$Y, cat_glm_tune_predicted_y,
  main = "Scatter Plot of true Y vs Predicted Y (cat_glm_tune)",
  xlab = "true Y",
  ylab = "Predicted Y (cat_glm_tune)",
  pch = 19,
  col = "blue"
)

# Add a 45-degree line for reference
abline(a = 0, b = 1, col = "red", lwd = 2)

All above objects in this section including cat_glm_tune_auto objects are outputs from the cat_glm_tune function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

  • function_name: The name of the function (cat_glm_tune).

  • tau: Selected optimal tau value from tau_seq that minimizes discrepancy error.

  • model: The fitted GLM model object obtained from stats::glm, with the selected tau (tau).

  • coefficients: The estimated coefficients from the fitted GLM model (model).

  • risk_estimate_list: Collected risk estimates across different tau values.

For more details, please check ?cat_glm_tune.

names(cat_glm_tune_auto)

User can extract items mentioned above from cat_glm_tune object.

# The method used for risk estimation
cat_glm_tune_auto$risk_estimate_method

# Selected optimal tau value from `tau_seq` that minimizes discrepancy error
cat_glm_tune_auto$tau

Step 2.3: Choose Method(s) - Bayesian Posterior Sampling with Fixed tau

Now, we will explore advanced Bayesian modeling techniques tailored for GLM gaussian using the catalytic package. Bayesian inference offers a powerful framework to estimate model parameters and quantify uncertainty by integrating prior knowledge with observed data.

Below functions enable Bayesian inference for GLM Linear Regression Model with catalytic priors. This function utilizes Markov chain Monte Carlo (MCMC) methods, implemented using the rstan package, to sample from the posterior distribution of model parameters. Users can specify various options such as the number of MCMC chains (chains), iterations (iter), warmup steps (warmup), and the MCMC algorithm (algorithm). User could also apply other attributes to rstan::sampling, like refresh and control.

In this section, we explore Bayesian approaches using the cat_glm_bayes function from the catalytic package. This function can fit a Bayesian Generalized Linear Model (GLM) using a fixed tau value. The MCMC sampling process will generate posterior distributions for the coefficients based on the specified tau.

Here’s a breakdown of the parameters used:

  • formula, cat_init and tau are same as above.

  • chains: Number of Markov chains to run during MCMC sampling in rstan. Defaults to 4.

  • iter: Total number of iterations per chain in the MCMC sampling process in rstan. Defaults to 2000.

  • warmup: Number of warmup iterations in the MCMC sampling process in rstan, discarded as burn-in. Defaults to 1000.

  • algorithm: Specifies the sampling algorithm used in rstan. Defaults to “NUTS” (No-U-Turn Sampler).

  • gaussian_variance_alpha: The shape parameter for the inverse-gamma prior on variance if the variance is unknown in Gaussian models. Defaults to the number of predictors.

  • gaussian_variance_beta: The scale parameter for the inverse-gamma prior on variance if the variance is unknown in Gaussian models. Defaults to the number of predictors times variance of observation response.

  • ... (ellipsis): Denotes additional arguments that can be passed directly to the underlying rstan::sampling function used within cat_glm_bayes to fit the Bayesian GLM model. These arguments allow for customization of the Bayesian GLM fitting process, such as control, refresh, or other model-specific settings.

For more details, please refer to ?cat_glm_bayes.

cat_glm_bayes_model <- cat_glm_bayes(
  formula = Y ~ ., # Same as `~ .`
  cat_init = cat_init,
  tau = 50, # Default: number of predictors / 4
  chains = 1, # Default: 4
  iter = 100, # Default: 2000
  warmup = 50, # Default: 1000
  algorithm = "NUTS", # Default: NUTS
  gaussian_variance_alpha = 1, # Default: number of predictors
  gaussian_variance_beta = 1 # Default: number of predictors times variance of observation response
)

cat_glm_bayes_model

Here shows how users can simplify the input for cat_glm_bayes. User do not have to specify tau and other attributes, as tau and other attributes have default value, which mentioned above. Here we assign lower value to chains, iter and warmup for quicker processing time.

User can also get the traceplot of the rstan model by using traceplot() directly into the output from cat_glm_bayes.

traceplot(cat_glm_bayes_model)

Plus, user can use this catlaytic::traceplot just like the rstan::traceplot, user can add parameters used in rstan::traceplot, like include and inc_warmup.

traceplot(cat_glm_bayes_model, inc_warmup = TRUE)

Let’s check the prediction error.

cat_glm_bayes_predicted_y <- predict(
  cat_glm_bayes_model,
  newdata = test_data
)

cat(
  "Catalytic cat_glm_bayes - Mean Square Error (Data):",
  mean((cat_glm_bayes_predicted_y - test_data$Y)^2)
)

cat(
  "\nCatalytic cat_glm_bayes - Sum Square Error (Coefficients):",
  sum((coef(cat_glm_bayes_model) - true_coefs)^2)
)

Let us check the scatter plot of the cat_glm_bayes_predicted_y from cat_glm_bayes versus the test_data$Y, this can be a great way to visually assess the accuracy and performance of the model.

plot(test_data$Y, cat_glm_bayes_predicted_y,
  main = "Scatter Plot of true Y vs Predicted Y (cat_glm_bayes)",
  xlab = "true Y",
  ylab = "Predicted Y (cat_glm_bayes)",
  pch = 19,
  col = "blue"
)

# Add a 45-degree line for reference
abline(a = 0, b = 1, col = "red", lwd = 2)

Both cat_glm_bayes_model and cat_glm_bayes_model objects are outputs from the cat_glm_bayes function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

  • function_name: The name of the function (cat_glm_bayes).

  • stan_data: The data list used for rstan::sampling.

  • stan_model: The rstan::stan_model object used for Bayesian modeling.

  • stan_sample_model: The result object obtained from rstan::sampling, encapsulating the MCMC sampling results.

  • coefficients: The mean estimated coefficients from the Bayesian GLM model, extracted from rstan::summary(stan_sample_model)$summary.

For more details, please refer to ?cat_glm_bayes.

names(cat_glm_bayes_model)

User can extract items mentioned above from cat_glm_bayes object.

# The number of Markov chains used during MCMC sampling in `rstan`.
cat_glm_bayes_model$chain

# The mean estimated coefficients from the Bayesian GLM model
cat_glm_bayes_model$coefficients

Step 2.4: Choose Method(s) - Bayesian Posterior Sampling with Adaptive Tau

In this section, we delve into Bayesian methodologies employing the cat_glm_bayes_joint function within the catalytic package. Unlike its non-adaptive counterpart (cat_glm_bayes), this method employs a joint tau prior approach where tau is treated as a parameter within the MCMC sampling process, improving the robustness and accuracy of parameter estimation in Bayesian gaussian modeling.

In this section, we explore Bayesian approaches using the cat_glm_bayes_joint function from the catalytic package. These functions are similar to their non-adaptive (non-joint) version cat_glm_bayes, but corporate tau into the MCMC sampling process.

Here’s a breakdown of the parameters used:

  • formula, and cat_init are same as above.

  • chains, iter, warmup, algorithm, gaussian_variance_alpha and gaussian_variance_beta are same in section Bayesian Posterior Sampling with Fixed Tau.

  • tau_alpha: Alpha parameter controlling degrees of freedom for distribution in the joint tau approach. Default is 2.

  • tau_gamma: Gamma parameter in the joint tau approach. Default is 1.

  • binomial_joint_theta: Logical. If TRUE, use theta parameter in the binomial model. Default is FALSE. More explanation in catalytic_glm_binomial.Rmd.

  • binomial_joint_alpha: Logical. If TRUE, use joint alpha in the binomial model. Default is FALSE. More explanation in catalytic_glm_binomial.Rmd.

  • binomial_tau_lower: Lower limit for the tau parameter in the binomial model. Default is 0.05. More explanation in catalytic_glm_binomial.Rmd.

  • ...(ellipsis): Denotes additional arguments that can be passed directly to the underlying rstan::sampling function used within cat_glm_bayes to fit the Bayesian GLM model. These arguments allow for customization of the Bayesian GLM fitting process, such as control, refresh, or other model-specific settings.

For more details, please refer to ?cat_glm_bayes_joint.

cat_glm_bayes_joint_model <- cat_glm_bayes_joint(
  formula = Y ~ ., # Same as `~ .`
  cat_init = cat_init,
  chains = 1, # Default: 4
  iter = 100, # Default: 2000
  warmup = 50, # Default: 1000
  algorithm = "NUTS", # Default: NUTS
  tau_alpha = 2, # Default: 2
  tau_gamma = 1 # Default: 1
)

cat_glm_bayes_joint_model

Here shows how users can simplify the input for cat_glm_bayes_joint. User do not have to specify tau_alpha and other attributes, as tau_alpha can derived from cat_init , while other attributes have default value, which mentioned above. Here we assign lower value to chains, iter and warmup for quicker processing time.

User can also get the traceplot of the rstan model by using traceplot() directly into the output from cat_glm_bayes_joint.

traceplot(cat_glm_bayes_joint_model)

Like the traceplot shown in the cat_glm_bayes function , user can add parameters used in rstan::traceplot, like include and inc_warmup.

traceplot(cat_glm_bayes_joint_model, inc_warmup = TRUE)

Let’s check the prediction error.

cat_glm_bayes_joint_predicted_y <- predict(
  cat_glm_bayes_joint_model,
  newdata = test_data
)

cat(
  "Catalytic `cat_glm_bayes_joint` - Mean Square Error (Data):",
  mean((cat_glm_bayes_joint_predicted_y - test_data$Y)^2)
)

cat(
  "\nCatalytic `cat_glm_bayes_joint` - Sum Square Error (Coefficients):",
  sum((coef(cat_glm_bayes_joint_model) - true_coefs)^2)
)

Let us check the scatter plot of the predicted_y from glm_model versus the test_data$Y, this can be a great way to visually assess the accuracy and performance of the model.

plot(test_data$Y,
  cat_glm_bayes_joint_predicted_y,
  main = "Scatter Plot of true Y vs Predicted Y (cat_glm_bayes_joint)",
  xlab = "true Y",
  ylab = "Predicted Y (cat_glm_bayes_joint)",
  pch = 19,
  col = "blue"
)

# Add a 45-degree line for reference
abline(a = 0, b = 1, col = "red", lwd = 2)

Both cat_glm_bayes_joint_model and cat_glm_bayes_joint_model objects are outputs from the cat_glm_bayes_joint function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

  • function_name: The name of the function (cat_glm_bayes_joint).

  • tau: The estimated tau parameter from the MCMC sampling rstan::sampling, depending on the model configuration.

  • stan_data, stan_model, stan_sample_model and coefficients are same in section Bayesian Posterior Sampling with Fixed Tau.

For more details, please refer to ?cat_glm_bayes_joint.

names(cat_glm_bayes_joint_model)

User can extract items mentioned above from cat_glm_bayes_joint object.

# The estimated tau parameter from the MCMC sampling `rstan::sampling`,
cat_glm_bayes_joint_model$tau

# The mean estimated coefficients from the Bayesian GLM model
cat_glm_bayes_joint_model$coefficients

References

Huang, Dongming, Nathan Stein, Donald B. Rubin, and S. C. Kou. 2020. “Catalytic Prior Distributions with Application to Generalized Linear Models.” ResearchGate. https://www.researchgate.net/publication/341417038_Catalytic_prior_distributions_with_application_to_generalized_linear_models.