Variable Selection in Data Envelopment Analysis with ADEA Method

Fernando Fernandez-Palacin1

Manuel Munoz-Marquez2

2024-11-12

Introduction

Variable selection in Data Envelopment Analysis (DEA) is a crucial consideration that requires careful attention before the results of an analysis can be applied in a real-world context. This is because the outcomes can vary significantly depending on the variables included in the model. Therefore, variable selection is a fundamental step in every DEA application.

ADEA provides a measure known as “load” to assess the contribution of a variable to a DEA model. In an ideal scenario where all variables contribute equally, all loads would be equal to 1. For instance, if the load of an output variable is 0.75, it signifies that its contribution is 75% of the average value for all outputs. A load value below 0.6 indicates that the variable’s contribution to the DEA model is negligible.

For more information about loads see the help of the package or see (Fernandez-Palacin, Lopez-Sanchez, and Munoz-Marquez 2018) and (Villanueva-Cantillo and Munoz-Marquez 2021).

Let’s load and inspect the “tokyo_libraries” dataset using the following code:

data(tokyo_libraries)
head(tokyo_libraries)
#>   Area.I1 Books.I2 Staff.I3 Populations.I4 Regist.O1 Borrow.O2
#> 1   2.249  163.523       26         49.196     5.561   105.321
#> 2   4.617  338.671       30         78.599    18.106   314.682
#> 3   3.873  281.655       51        176.381    16.498   542.349
#> 4   5.541  400.993       78        189.397    30.810   847.872
#> 5  11.381  363.116       69        192.235    57.279   758.704
#> 6  10.086  541.658      114        194.091    66.137  1438.746

Step wise variable selection

Two stepwise variable selection functions are provided. The first one eliminates variables one by one, creating a set of nested models. The following code sets up input and output variables and performs the call:

input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]
adea_hierarchical(input, output)
#>       Loads nEfficients nVariables nInputs nOutputs
#> 6 0.4554670           6          6       4        2
#> 5 0.9901640           6          5       3        2
#> 4 0.8533008           3          4       2        2
#> 3 0.6574467           2          3       1        2
#> 2 1.0000000           1          2       1        1
#>                                        Inputs              Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5          Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 4                          Books.I2, Staff.I3 Regist.O1, Borrow.O2
#> 3                                    Books.I2 Regist.O1, Borrow.O2
#> 2                                    Books.I2            Borrow.O2

The load of the first model falls under the minimum significance level, indicating that Area.I1 can be removed from the model.

When a variable is removed, one would expect the load of all remaining variables to increase. However, this doesn’t occur after the second model. Therefore, the third model is inferior to the second, and there is no statistical rationale for selecting it.

To avoid that, a second step wise selection variable is provided, the new call is as follows:

adea_parametric(input, output)
#>      Loads nEfficients nVariables nInputs nOutputs
#> 6 0.455467           6          6       4        2
#> 5 0.990164           6          5       3        2
#> 2 1.000000           1          2       1        1
#>                                        Inputs              Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5          Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 2                                    Books.I2            Borrow.O2

In both cases, all variables are considered for removal, but the load.orientation parameter allows for selecting which variables to include in the load analysis. You can choose input for only input variables, output for only output variables, or inoutput, which is the default value for all variables. The following call only considers output variables as candidate variables for removal:

adea_parametric(input, output, load.orientation = 'output')
#>   Loads nEfficients nVariables nInputs nOutputs
#> 6     1           6          6       4        2
#>                                        Inputs              Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2

Both adea_hierarchical and adea_parametric return a list called models, which contains all computed models and can be accessed using the following call:

m <- adea_hierarchical(input, output)
m4 <- m$models[[4]]
m4
#>  [1] 0.2260062 0.6377375 0.5400548 0.5930209 0.9112849 0.7449643 0.6496709
#>  [8] 0.5391304 0.8966427 0.7051438 0.5387076 0.7191553 0.6381740 0.7152620
#> [15] 0.8440736 0.5822710 1.0000000 0.7867065 1.0000000 0.8485716 0.7872304
#> [22] 0.6806063 1.0000000

where the number in square brackets represents the number of total variables in the model.

By default, when the print function is used with an ADEA model, it displays only efficiencies. The summary function provides a more comprehensive output:

summary(m4)
#>                                          
#> Model name                               
#> Orientation                         input
#> Load orientation                 inoutput
#> Model load              0.853300754553448
#> Input load.Books.I2      1.14669924544655
#> Input load.Staff.I3     0.853300754553448
#> Output load.Regist.O1   0.853300754553448
#> Output load.Borrow.O2    1.14669924544655
#> Inputs                  Books.I2 Staff.I3
#> Outputs               Regist.O1 Borrow.O2
#> nInputs                                 2
#> nOutputs                                2
#> nVariables                              4
#> nEfficients                             3
#> Eff. Mean               0.721061510571196
#> Eff. sd                  0.18362174896548
#> Eff. Min.               0.226006153331096
#> Eff. 1st Qu.            0.615379209850103
#> Eff. Median             0.715262000660375
#> Eff. 3rd Qu.            0.846322575997714
#> Eff. Max.                               1

References

Fernandez-Palacin, Fernando, Marı́a Auxiliadora Lopez-Sanchez, and Manuel Munoz-Marquez. 2018. Stepwise selection of variables in DEA using contribution loads.” Pesquisa Operacional 38 (1): 31–52. http://dx.doi.org/10.1590/0101-7438.2018.038.01.0031.
Villanueva-Cantillo, Jeyms, and Manuel Munoz-Marquez. 2021. “Methodology for Calculating Critical Values of Relevance Measures in Variable Selection Methods in Data Envelopment Analysis.” European Journal of Operational Research 290 (2): 657–70. https://doi.org/10.1016/j.ejor.2020.08.021.

  1. Universidad de Cádiz, ↩︎

  2. Universidad de Cádiz, ↩︎