Title: | Computation and Decomposition of the Mutual Information Index |
---|---|
Description: | The Mutual Information Index (M) introduced to social science literature by Theil and Finizza (1971) <doi:10.1080/0022250X.1971.9989795> is a multigroup segregation measure that is highly decomposable and that according to Frankel and Volij (2011) <doi:10.1016/j.jet.2010.10.008> and Mora and Ruiz-Castillo (2011) <doi:10.1111/j.1467-9531.2011.01237.x> satisfies the Strong Unit Decomposability and Strong Group Decomposability properties. This package allows computing and decomposing the total index value into its "between" and "within" terms. These last terms can also be decomposed into their contributions, either by group or unit characteristics. The factors that produce each "within" term can also be displayed at the user's request. The results can be computed considering a variable or sets of variables that define separate clusters. |
Authors: | Cristian Angulo-Gonzalez [aut, cre], Rafael Fuentealba-Chaura [aut], Ricardo Mora [aut], Julio Rojas-Mora [aut], FONDECYT/ANID Project 11170583 [fnd], MCIN/AEI/10.13039/501100011033 (Project no. PID2019-108576RB-I00) [fnd], UCT VIP Project FEQUIP2019-INRN-03 [fnd] |
Maintainer: | Cristian Angulo-Gonzalez <[email protected]> |
License: | GPL-3 |
Version: | 2.0.3 |
Built: | 2025-01-19 12:56:35 UTC |
Source: | CRAN |
The Mutual Information Index (M) introduced to the social sciences by Theil and Finizza (1971). The M index is a multigroup segregation measure that is highly decomposable, satisfiying both the Strong Unit Decomposability (SUD) and the Strong Group Decomposability (SGD) properties (Frankel and Volij, 2011; Mora and Ruiz-Castillo, 2011).
The package allows for:
The computation of the M index, either overall or over subsamples defined by the user.
The decomposition of the M index into a "between" and a "within" term.
The identification of the "exclusive contributions" of segregation sources defined either by group or unit characteristics.
The computation of all the elements that conform the "within" term in the decomposition.
Fast computation employing more than one CPU core in Mac, Linux, Unix, and BSD systems. This option uses the data.table and parallel libraries (which Windows does not permit to run with more than one CPU core).
Rafael Fuentealba-Chaura [email protected]
Cristian Angulo-Gonzalez [email protected]
Ricardo Mora [email protected]
Julio Rojas-Mora [email protected]
Frankel, D. and Volij, O. (2011). Measuring school segregation. Journal of Economic Theory, 146(1):1-38. doi:10.1016/j.jet.2010.10.008.
Guinea-Martin, D., Mora, R., & Ruiz-Castillo, J. (2018). The evolution of gender segregation over the life course. American Sociological Review, 83(5), 983-1019. doi:10.1177/0003122418794503.
Mora, R. and Guinea-Martin, D. (2021). Computing decomposable multigroup indexes of segregation. UC3M Working papers, Economics 31803, Universidad Carlos III de Madrid. Departamento de Economía.
Mora, R. and Ruiz-Castillo, J. (2011). Entropy-based segregation indices. Sociological Methodology, 41(1):159-194. doi:10.1111/j.1467-9531.2011.01237.x.
Theil, H. and Finizza, A. J. (1971). A note on the measurement of racial integration of schools by means of informational concepts. The Journal of Mathematical Sociology, 1(2):187-193. doi:10.1080/0022250X.1971.9989795.
The data set included in this package was build using two data sets. The first one is the student enrollment reported by the Ministry of Education (MINEDUC, https://datosabiertos.mineduc.cl/) for students of primary education (first eight years of formal education) who attended establishments officially recognized by the State. The second one is the Quality and Context of Education Questionnaire for Parents and Guardians, and the Student Questionnaire, both applied by the Education Quality Agency (https://www.agenciaeducacion.cl/) to all students in grades 4 and 8 of primary education. Both sources are limited to the period 2016-2018. Contains information related to students and educational system characteristics in southern Chile (Biobio, La Araucania and Los Rios regions).
DF_Seg_Chile
DF_Seg_Chile
A data.frame
with 191495 observations and 11 variables:
Student enrollment year. From 2016 to 2018.
School ID (RBD, Rol de Base de Datos).
Administrative district where the school is located.
Preferential Scholar Subsidy Category (from the SpanishCategoría de Sub-vención Escolar Preferencial). Students belong to either the non-subsidized, the partially-subsidized, or the subsidized group acording to the Act 20.248 of Preferencial Scholar Subsidy (SEP).
Self-reported Mapuche ethnicity. Students belong to Mapuche ethnicity or not.
School with multiage classrooms. The school is located in a urban zone or not.
Administrative region where the school is located. Schools can belong either Biobio region, La Araucania region or Los Rios region.
Whether the school is public, charter, or private.
Student gender code. Students can either be female or male.
Student grade. Students can either belong to the 4th (4) or 8th (8) grade of basic school.
Number of students in a cell or combination of variables.
Ministry of Education (MINEDUC): https://datosabiertos.mineduc.cl/
Education Quality Agency: https://www.agenciaeducacion.cl/
The data set included in this package was build using two data sets. The first one is the student enrollment reported by the Ministry of Education (MINEDUC, https://datosabiertos.mineduc.cl/) for students of primary education (first eight years of formal education) who attended establishments officially recognized by the State. The second one is the Quality and Context of Education Questionnaire for Parents and Guardians, and the Student Questionnaire, both applied by the Education Quality Agency (https://www.agenciaeducacion.cl/) to all students in grades 4 and 8 of primary education. Both sources are limited to the period 2016-2018. Contains information related to students and educational system characteristics in southern Chile (Biobio, La Araucania and Los Rios regions).
DT_Seg_Chile
DT_Seg_Chile
A data.table
with 55960 observations and 11 variables:
Student enrollment year. From 2016 to 2018.
School ID (RBD, Rol de Base de Datos).
Administrative district where the school is located.
Preferential Scholar Subsidy Category (from the SpanishCategoría de Sub-vención Escolar Preferencial). Students belong to either the non-subsidized, the partially-subsidized, or the subsidized group acording to the Act 20.248 of Preferencial Scholar Subsidy (SEP).
Self-reported Mapuche ethnicity. Students belong to Mapuche ethnicity or not.
School with multiage classrooms. The school is located in a urban zone or not.
Administrative region where the school is located. Schools can belong either Biobio region, La Araucania region or Los Rios region.
Whether the school is public, charter, or private.
Student gender code. Students can either be female or male.
Student grade. Students can either belong to the 4th (4) or 8th (8) grade of basic school.
Number of students in a cell or combination of variables.
Ministry of Education (MINEDUC): https://datosabiertos.mineduc.cl/
Education Quality Agency: https://www.agenciaeducacion.cl/
The data set included in this package was build using two data sets. The first one is the student enrollment reported by the Ministry of Education (MINEDUC, https://datosabiertos.mineduc.cl/) for students of primary education (first eight years of formal education) who attended establishments officially recognized by the State. The second one is the Quality and Context of Education Questionnaire for Parents and Guardians, and the Student Questionnaire, both applied by the Education Quality Agency (https://www.agenciaeducacion.cl/) to all students in grades 4 and 8 of primary education. Both sources are limited to 2018. Contains information related to students and educational system characteristics in southern Chile (Biobio, La Araucania and Los Rios regions).
DT_test
DT_test
A data.table
with 6703 observations and 5 variables, only for testing pourposes:
School ID (RBD, Rol de Base de Datos).
Preferential Scholar Subsidy Category (from the SpanishCategoría de Sub-vención Escolar Preferencial). Students belong to either the non-subsidized, the partially-subsidized, or the subsidized group acording to the Act 20.248 of Preferencial Scholar Subsidy (SEP).
Self-reported Mapuche ethnicity. Students belong to Mapuche ethnicity or not.
Administrative region where the school is located. Schools can belong either Biobio region, La Araucania region or Los Rios region.
Number of students in a cell or combination of variables.
Ministry of Education (MINEDUC): https://datosabiertos.mineduc.cl/
Education Quality Agency: https://www.agenciaeducacion.cl/
Computes and decomposes the Mutual Information index into "between" and "within" terms. The "within" terms can also be decomposed into "exclusive contributions" of segregation sources defined either by group or unit characteristics. The mathematical components required to compute each "within" term can also be displayed at the user's request. The results can be computed over subsamples defined by the user.
mutual( data, group, unit, within = NULL, by = NULL, contribution.from = NULL, components = FALSE, cores = NULL )
mutual( data, group, unit, within = NULL, by = NULL, contribution.from = NULL, components = FALSE, cores = NULL )
data |
An object from the "data.table" and "mutual.data" classes. |
group |
A categorical variable name or vector of categorical variables names contained in |
unit |
A categorical variable name or vector of categorical variables names contained in |
within |
A categorical variable name or vector of categorical variables names contained in |
by |
A categorical variable name or vector of categorical variables names contained in |
contribution.from |
A variable of character type that can be 'group_vars' or 'unit_vars', or also, a categorical
variable name or vector of categorical variables names contained in the |
components |
A boolean value. If TRUE and the |
cores |
A positive integer. Defines the amount of CPU cores to use in parallelization tasks. If |
Mixing group
variables with unit
variables in contribution.from
will produce an error.
A data.table
if the components
option is FALSE
; a list if the components
option is TRUE
,
the within
option is not NULL
and the by
option is NULL
; or a list of lists if the components
option is TRUE
, and both within
and by
options are not NULL
.
Frankel, D. and Volij, O. (2011). Measuring school segregation. Journal of Economic Theory, 146(1):1-38. doi:10.1016/j.jet.2010.10.008.
Guinea-Martin, D., Mora, R., & Ruiz-Castillo, J. (2018). The evolution of gender segregation over the life course. American Sociological Review, 83(5), 983-1019. doi:10.1177/0003122418794503.
Mora, R. and Guinea-Martin, D. (2021). Computing decomposable multigroup indexes of segregation. UC3M Working papers, Economics 31803. Universidad Carlos III de Madrid. Departamento de Economía.
Mora, R. and Ruiz-Castillo, J. (2011). Entropy-based segregation indices. Sociological Methodology, 41(1):159-194. doi:10.1111/j.1467-9531.2011.01237.x.
Theil, H. and Finizza, A. J. (1971). A note on the measurement of racial integration of schools by means of informational concepts. The Journal of Mathematical Sociology, 1(2):187-193. doi:10.1080/0022250X.1971.9989795.
# To compute the overall measure of school segregation by socioeconomic and ethnic status. mutual(data = DT_test, group = c("csep", "ethnicity"), unit = "school") # Computation of the exclusive effect of specific segregation sources on the overall measure, e.g., # socioeconomic and ethnic contributions, and the contribution that cannot be attributed to any of # them (the "interaction" term). mutual(data = DT_test, group = c("csep", "ethnicity"), unit = "school", by = "region", contribution.from = "group_vars") # For more information on the package, refer to the manual and the README file.
# To compute the overall measure of school segregation by socioeconomic and ethnic status. mutual(data = DT_test, group = c("csep", "ethnicity"), unit = "school") # Computation of the exclusive effect of specific segregation sources on the overall measure, e.g., # socioeconomic and ethnic contributions, and the contribution that cannot be attributed to any of # them (the "interaction" term). mutual(data = DT_test, group = c("csep", "ethnicity"), unit = "school", by = "region", contribution.from = "group_vars") # For more information on the package, refer to the manual and the README file.
mutual
functionReceives the data that is later used in the mutual
function.
Generates a data.table
with the entry variables.
prepare_data(data, vars, fw = NULL, col.order = NULL)
prepare_data(data, vars, fw = NULL, col.order = NULL)
data |
A tabular format object ( |
vars |
A vector of variable names or vector of columns numbers contained in |
fw |
Variable name or column number contained in |
col.order |
A variable name or vector of variables names contained in |
Returns a data.table
of class "data.table" "data.frame" "mutual.data".
# Using some variable names in 'data' with explicit 'fw'. my_data <- prepare_data(data = DF_Seg_Chile, vars = c("csep", "ethnicity", "school", "district"), fw = "nobs") # Using some column numbers in 'data' and explicit 'fw' as another column number. my_data <- prepare_data(data = DF_Seg_Chile, vars = c(4, 5, 2, 3), fw = 11) # Using all variables of 'data' with explicit 'fw'. my_data <- prepare_data(data = DF_Seg_Chile, vars = "all_vars", fw = "nobs") # Using some variable names in 'data' and 'fw' does not exist (in this case, the new 'fw' will # be equal to 1 for all variable combinations as 'data' already has a frequency weights variable) my_data <- prepare_data(data = DF_Seg_Chile, vars = c("csep", "ethnicity", "school", "district")) # Using the 'col.order' option to sort data according to the 'csep' column. my_data <- prepare_data(data = DF_Seg_Chile, vars = c("csep", "ethnicity", "school", "district"), fw = "nobs", col.order = "csep") # The class of the resulting object in all cases must be "data.table", "data.frame" and # "mutual.data". class(my_data)
# Using some variable names in 'data' with explicit 'fw'. my_data <- prepare_data(data = DF_Seg_Chile, vars = c("csep", "ethnicity", "school", "district"), fw = "nobs") # Using some column numbers in 'data' and explicit 'fw' as another column number. my_data <- prepare_data(data = DF_Seg_Chile, vars = c(4, 5, 2, 3), fw = 11) # Using all variables of 'data' with explicit 'fw'. my_data <- prepare_data(data = DF_Seg_Chile, vars = "all_vars", fw = "nobs") # Using some variable names in 'data' and 'fw' does not exist (in this case, the new 'fw' will # be equal to 1 for all variable combinations as 'data' already has a frequency weights variable) my_data <- prepare_data(data = DF_Seg_Chile, vars = c("csep", "ethnicity", "school", "district")) # Using the 'col.order' option to sort data according to the 'csep' column. my_data <- prepare_data(data = DF_Seg_Chile, vars = c("csep", "ethnicity", "school", "district"), fw = "nobs", col.order = "csep") # The class of the resulting object in all cases must be "data.table", "data.frame" and # "mutual.data". class(my_data)