Beta-Select Demonstration: SEM by ‘lavaan’

Introduction

This article demonstrates how to use lav_betaselect() from the package betaselectr to standardize selected variables in a model fitted by lavaan and forming confidence intervals for the parameters.

Data and Model

The sample dataset from the package betaselectr will be used in this demonstration:

library(betaselectr)
head(data_test_medmod)
#>          dv       iv      mod      med     cov1     cov2
#> 1  7.487873 11.42573 16.65805 42.28988 54.14051 15.56069
#> 2  8.474931 16.64790 22.66332 42.08692 39.21125 17.61286
#> 3 11.206539 14.81278 22.80955 32.76869 31.97963 20.77333
#> 4 10.148827 15.79632 22.94451 43.96807 42.72187 15.66971
#> 5  7.421606 14.29621 24.51562 37.10942 42.74174 21.97132
#> 6  6.846435 12.00819 25.22163 35.46051 30.85914 22.35444

This is the path model, fitted by lavaan::sem():

library(lavaan)
#> This is lavaan 0.6-19
#> lavaan is FREE software! Please report any bugs.
mod <-
"
med ~ iv + mod + iv:mod + cov1 + cov2
dv ~ med + iv + cov1 + cov2
"
fit <- sem(mod,
           data_test_medmod)

The model has a moderator, mod, posited to moderate the effect from iv to med. The product term is iv:mod.

These are the results:

summary(fit)
#> lavaan 0.6-19 ended normally after 2 iterations
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of model parameters                        11
#> 
#>   Number of observations                           200
#> 
#> Model Test User Model:
#>                                                       
#>   Test statistic                                 2.303
#>   Degrees of freedom                                 2
#>   P-value (Chi-square)                           0.316
#> 
#> Parameter Estimates:
#> 
#>   Standard errors                             Standard
#>   Information                                 Expected
#>   Information saturated (h1) model          Structured
#> 
#> Regressions:
#>                    Estimate   Std.Err  z-value  P(>|z|)
#>   med ~                                                
#>     iv                -6.373    0.985   -6.473    0.000
#>     mod               -3.899    0.614   -6.346    0.000
#>     iv:mod             0.286    0.039    7.340    0.000
#>     cov1              -0.093    0.070   -1.327    0.185
#>     cov2               0.242    0.133    1.823    0.068
#>   dv ~                                                 
#>     med                0.092    0.011    8.098    0.000
#>     iv                 0.227    0.038    5.896    0.000
#>     cov1              -0.006    0.013   -0.454    0.650
#>     cov2               0.030    0.025    1.230    0.219
#> 
#> Variances:
#>                    Estimate   Std.Err  z-value  P(>|z|)
#>    .med               60.292    6.029   10.000    0.000
#>    .dv                 2.087    0.209   10.000    0.000

Problems With Standardization

We can request the standardized solution using lavaan::standardizedSolution():

standardizedSolution(fit,
                     output = "text")
#> 
#> Regressions:
#>                     est.std  Std.Err  z-value  P(>|z|) ci.lower ci.upper
#>   med ~                                                                 
#>     iv               -1.855    0.259   -7.158    0.000   -2.363   -1.347
#>     mod              -1.956    0.280   -6.988    0.000   -2.504   -1.407
#>     iv:mod            3.588    0.428    8.390    0.000    2.750    4.426
#>     cov1             -0.077    0.058   -1.332    0.183   -0.189    0.036
#>     cov2              0.105    0.057    1.836    0.066   -0.007    0.218
#>   dv ~                                                                  
#>     med               0.459    0.052    8.845    0.000    0.357    0.560
#>     iv                0.331    0.053    6.279    0.000    0.228    0.434
#>     cov1             -0.024    0.054   -0.454    0.650   -0.130    0.081
#>     cov2              0.066    0.054    1.233    0.218   -0.039    0.171
#> 
#> Variances:
#>                     est.std  Std.Err  z-value  P(>|z|) ci.lower ci.upper
#>    .med               0.656    0.050   13.243    0.000    0.559    0.753
#>    .dv                0.569    0.050   11.353    0.000    0.471    0.667

However, for this model, there are several problems:

  • The product term, iv:mod, is also standardized. This is inappropriate. One simple but underused solution is to standardize the variables before forming the product term (Friedrich, 1982).

  • The confidence intervals are formed using the delta-method, which has been found to be inferior to methods such as nonparametric percentile bootstrap confidence interval for the standardized solution (Falk, 2018). Although there are situations in which the delta-method confidence and the nonparametric percentile bootstrap confidences can be similar (e.g., sample size is large and the sample estimates are not extreme), it is still safe to at least try both methods and compare the results.

  • There are cases in which some variables are measured by meaningful units and do not need to be standardized. for example, if cov1 is age measured by year, then age is more meaningful than “standardized age”.

  • In path analysis, categorical variables are usually represented by dummy variables, each of them having only two possible values (0 or 1). It is not meaningful to standardize the dummy variables.

Beta-Select lav_betaselect()

The function lav_betaselect() can be used to solve this problem by:

  • standardizing variables before product terms are formed,

  • standardizing only variables for which standardization can facilitate interpretation, and

  • forming confidence intervals that take into account selected standardization.

We call the coefficients computed by this kind of standardization betas-select (βsSelect, βSelect in singular form), to differentiate them from coefficients computed by standardizing all variables, including product terms.

Estimates Only

Suppose we only need to solve the first problem, with the product term computed after iv and mod are standardized:

fit_beta <- lav_betaselect(fit)
fit_beta

This is the output if printed using the default options:

#> 
#> Selected Standardization:
#>                     
#>  Standard Error: Nil
#> 
#> Parameter Estimates Settings:
#>                                              
#>  Standard errors:                  Standard  
#>  Information:                      Expected  
#>  Information saturated (h1) model: Structured
#> 
#> Regressions:
#>            BetaSelect
#>  med ~               
#>   iv           -1.855
#>   mod          -1.956
#>   iv:mod        0.400
#>   cov1         -0.077
#>   cov2          0.105
#>  dv ~                
#>   med           0.459
#>   iv            0.331
#>   cov1         -0.024
#>   cov2          0.066
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.

Compared to the solution with the product term standardized, the coefficient of iv:mod changed substantially from 3.588 to 0.286. As shown by Cheung et al. (2022), the coefficient of standardized product term (iv:mod) can be substantially different from the properly standardized product term (the product of standardized iv and standardized mod).

The footnote will also indicate variables that are standardized, and remarked that product terms are formed after standardization.

Estimates and Bootstrap Confidence Interval

Suppose we want to address both the first and the second problems, with

  • the product term computed after iv and mod standardized, and

  • bootstrap confidence intervals used, that take into account the sampling variation of the standardizers (the standard deviations).

We can call lav_betaselect() again, with additional arguments set:

fit_beta <- lav_betaselect(fit,
                           std_se = "bootstrap",
                           bootstrap = 5000,
                           iseed = 2345,
                           parallel = "snow",
                           ncpus = 20)
#> Finding product terms in the model ...
#> Finished finding product terms.
#> 
#> Compute bootstrapping standardized solution:

These are the additional arguments:

  • std_se: The method to compute the standard errors as well as confidence intervals. Set to "bootstrap" for nonparametric bootstrapping.

  • iseed: The seed for the random number generator used for bootstrapping. Set this to an integer to make the results reproducible.

  • parallel: The method to be used for parallel processing. It will be passed to lavaan::bootstrapLavaan(). Supported values are "none", "snow", and "multicore".

  • ncpus: The number of CPU cores to use if parallel processing is not "none". Default is parallel::detectCores(logical = FALSE) - 1, or the number of physical cores minus one.

This is the output if printed with default options:

fit_beta
#> 
#> Selected Standardization:
#>                                              
#>  Standard Error:      Nonparametric bootstrap
#>  Bootstrap samples:   5000                   
#>  Confidence Interval: Percentile             
#>  Level of Confidence: 95.0%                  
#> 
#> Parameter Estimates Settings:
#>                                              
#>  Standard errors:                  Standard  
#>  Information:                      Expected  
#>  Information saturated (h1) model: Structured
#> 
#> Regressions:
#>            BetaSelect    SE      Z p-value Sig  CI.Lo  CI.Hi CI.Sig
#>  med ~                                                             
#>   iv           -1.855 0.248 -7.490   0.000 *** -2.307 -1.332   Sig.
#>   mod          -1.956 0.281 -6.950   0.000 *** -2.453 -1.348   Sig.
#>   iv:mod        0.400 0.047  8.565   0.000 ***  0.298  0.481   Sig.
#>   cov1         -0.077 0.057 -1.353   0.185     -0.186  0.038   n.s.
#>   cov2          0.105 0.061  1.725   0.094   . -0.019  0.219   n.s.
#>  dv ~                                                              
#>   med           0.459 0.052  8.828   0.000 ***  0.348  0.553   Sig.
#>   iv            0.331 0.051  6.480   0.000 ***  0.229  0.431   Sig.
#>   cov1         -0.024 0.058 -0.418   0.686     -0.137  0.090   n.s.
#>   cov2          0.066 0.058  1.139   0.259     -0.050  0.178   n.s.
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#>   for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#>   by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.

In this dataset, with 200 cases, the delta-method confidence intervals are close to the bootstrap confidence intervals, except obviously for the product term because the coefficient of the product term has substantially different values in the two solutions.

Estimates and Bootstrap Confidence Intervals, With Only Selected Variables Standardized

Suppose we want to address also the the third issue, and standardize only some of the variables. This can be done using either to_standardize or not_to_standardize.

  • Use to_standardize when the number of variables to standardize is much fewer than the number of variables not to standardize.

  • Use not_to_standardize when the number variables to standardize is much more than the the number of variables not to standardize.

For example, suppose we only need to standardize dv and iv, cov1, and cov2, this is the call to do this, setting to_standardize to c("iv", "dv", "cov1", "cov2"):

fit_beta_select_1 <- lav_betaselect(fit,
                                    std_se = "bootstrap",
                                    to_standardize = c("iv", "dv", "cov1", "cov2"),
                                    bootstrap = 5000,
                                    iseed = 2345,
                                    parallel = "snow",
                                    ncpus = 20)

If we want to standardize all variables except for dv and mod, we can use this call, and set not_to_standardize to c("mod", "dv"):

fit_beta_select_2 <- lav_betaselect(fit,
                                    std_se = "bootstrap",
                                    not_to_standardize = c("mod", "dv"),
                                    bootstrap = 5000,
                                    iseed = 2345,
                                    parallel = "snow",
                                    ncpus = 20)

The results of these calls are identical, and only those of the second version are printed:

fit_beta_select_2
#> Selected Standardization:
#>                                              
#>  Standard Error:      Nonparametric bootstrap
#>  Bootstrap samples:   5000                   
#>  Confidence Interval: Percentile             
#>  Level of Confidence: 95.0%                  
#> 
#> Parameter Estimates Settings:
#>                                              
#>  Standard errors:                  Standard  
#>  Information:                      Expected  
#>  Information saturated (h1) model: Structured
#> 
#> Regressions:
#>            BetaSelect    SE      Z p-value Sig  CI.Lo  CI.Hi CI.Sig
#>  med ~                                                             
#>   iv           -1.855 0.248 -7.490   0.000 *** -2.307 -1.332   Sig.
#>   mod          -0.407 0.059 -6.950   0.000 *** -0.510 -0.280   Sig.
#>   iv:mod        0.083 0.010  8.565   0.000 ***  0.062  0.100   Sig.
#>   cov1         -0.077 0.057 -1.353   0.185     -0.186  0.038   n.s.
#>   cov2          0.105 0.061  1.725   0.094   . -0.019  0.219   n.s.
#>  dv ~                                                              
#>   med           0.878 0.116  7.567   0.000 ***  0.635  1.092   Sig.
#>   iv            0.634 0.100  6.337   0.000 ***  0.430  0.826   Sig.
#>   cov1         -0.047 0.112 -0.418   0.686     -0.265  0.168   n.s.
#>   cov2          0.126 0.111  1.137   0.259     -0.093  0.345   n.s.
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, iv, med
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#>   for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#>   by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.

The footnotes show that, by specifying that dv and mod are not standardized, all the other four variables are standardized: iv, med, cov1, and cov2. Therefore, in this case, it is more convenient to use not_to_standardize.

When reporting betas-select, researchers need to state which variables are standardized and which are not. This can be done in table notes, or in a column of the parameter estimate tables. The output can of lav_betaselect() can be printed with show_Bs.by set to TRUE to demonstrate the second approach:

print(fit_beta_select_2,
      show_Bs.by = TRUE)
#> Regressions:
#>            BetaSelect    SE      Z p-value Sig  CI.Lo  CI.Hi CI.Sig  Selected
#>  med ~                                                                       
#>   iv           -1.855 0.248 -7.490   0.000 *** -2.307 -1.332   Sig.    iv,med
#>   mod          -0.407 0.059 -6.950   0.000 *** -0.510 -0.280   Sig.       med
#>   iv:mod        0.083 0.010  8.565   0.000 ***  0.062  0.100   Sig.    iv,med
#>   cov1         -0.077 0.057 -1.353   0.185     -0.186  0.038   n.s.  cov1,med
#>   cov2          0.105 0.061  1.725   0.094   . -0.019  0.219   n.s.  cov2,med
#>  dv ~                                                                        
#>   med           0.878 0.116  7.567   0.000 ***  0.635  1.092   Sig.       med
#>   iv            0.634 0.100  6.337   0.000 ***  0.430  0.826   Sig.        iv
#>   cov1         -0.047 0.112 -0.418   0.686     -0.265  0.168   n.s.      cov1
#>   cov2          0.126 0.111  1.137   0.259     -0.093  0.345   n.s.      cov2
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, iv, med
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#>   for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#>   by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - The column 'Selected' lists variable(s) standardized when computing
#>   the standardized coefficient of a parameter. ('NA' for user-defined
#>   parameters because they are computed from other standardized
#>   parameters.)
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.

Categorical Variables

When calling lav_betaselect(), variables with only two values in the dataset are assumed to be categorical and will not be standardized by default. This can be overriden by setting skip_categorical_x to FALSE, though not recommended.

Conclusion

In structural equation modeling, there are situations in which standardizing all variables is not appropriate, or when standardization needs to be done before forming product terms. We are not aware of tools that can do appropriate standardization and form confidence intervals that takes into account the selective standardization. By promoting the use of betas-select using lav_betaselect(), we hope to make it easier for researchers to do appropriate Standardization in when reporting SEM results.

References

Cheung, S. F., Cheung, S.-H., Lau, E. Y. Y., Hui, C. H., & Vong, W. N. (2022). Improving an old way to measure moderation effect in standardized units. Health Psychology, 41(7), 502–505. https://doi.org/10.1037/hea0001188
Falk, C. F. (2018). Are robust standard errors the best approach for interval estimation with nonnormal data in structural equation modeling? Structural Equation Modeling: A Multidisciplinary Journal, 25(2), 244–266. https://doi.org/10.1080/10705511.2017.1367254
Friedrich, R. J. (1982). In defense of multiplicative terms in multiple regression equations. American Journal of Political Science, 26(4), 797–833. https://doi.org/10.2307/2110973