| Title: | Statistical Matching Using Latent Class Models |
|---|---|
| Description: | Tools for statistical matching based on latent class models. The package implements statistical matching procedures based on latent class models. It allows researchers to perform data integration when no unique identifiers are available by modeling the joint distribution of variables through latent categorical structures. The package supports estimation of latent class models, probabilistic matching between donor and recipient data sets, and generation of synthetic linked data under uncertainty. It is particularly useful in survey research and data fusion applications where combining information from multiple sources is required while preserving statistical properties and accounting for measurement error and missing data mechanisms. |
| Authors: | Alicja Wolny-Dominiak [aut, cre], Israa Lewaaelhamd [aut], Mohammed Ali Ismail [aut] |
| Maintainer: | Alicja Wolny-Dominiak <[email protected]> |
| License: | GPL-3 |
| Version: | 1.2 |
| Built: | 2026-05-15 22:13:01 UTC |
| Source: | https://github.com/cran/statMatchLCM |
A simple dataset with categorical variables.
datAdatA
A data frame with 20 observations and 2 variables:
Category (e.g., "F", "M")
Color category (e.g., "blue")
Simulated data
Creates a joint data set with missing Y1/Z1, required for mass data combination.
datAB_to_SM(datA, datB)datAB_to_SM(datA, datB)
datA |
data.frame A |
datB |
data.frame B |
data.frame with harmonized structure
data(datA) data(datB) datAB_to_SM(datA, datB)data(datA) data(datB) datAB_to_SM(datA, datB)
A simple dataset with categorical variables.
datBdatB
A data frame with 15 observations and 2 variables:
Category (e.g., "F", "M")
Color category (e.g., "red", "green", "blue")
Simulated data
Converts all factor or character columns in a data.frame to numeric codes and stores the mapping tables.
fact_to_num(df)fact_to_num(df)
df |
data.frame with factor or character columns |
A list with:
data.frame with numeric-coded variables
list of factor levels
list of mapping tables (factor → numeric)
data(datA) fact_to_num(datA)data(datA) fact_to_num(datA)
Restores a factor variable from numeric codes
using a mapping table created by fact_to_num().
num_to_fact(x, table)num_to_fact(x, table)
x |
numeric vector |
table |
data.frame with columns level_fact and level_num |
A factor vector:
Defined by table$level_num.
Defined by table$level_fact.
Evaluates the quality of the synthetic target variable by computing the Hellinger distance between the reference and synthetic distributions. This measure quantifies the degree of similarity between the two distributions, providing an assessment of the accuracy and coherence of the data fusion process
sm_quality(step1, step2)sm_quality(step1, step2)
step1 |
output from |
step2 |
output from step 2 method |
A list with:
A numeric value representing the Hellinger distance between the reference and synthetic distributions.
A numeric vector or table representing the reference (original) distribution.
A numeric vector or table representing the synthetic (generated) distribution.
Selects the imputed dataset minimizing the Hellinger distance between the reference distribution from dataset A and the synthetic distribution from dataset B, in a three-sample statistical matching framework (A, B, C).
sma_step1(datA, datB, datC, output_ll)sma_step1(datA, datB, datC, output_ll)
datA |
data.frame A |
datB |
data.frame B |
datC |
data.frame C |
output_ll |
list with imputed datasets (impdata) |
A list with imputed datasets:
The full imputed dataset combining A, B, and C.
Subset of the imputed data corresponding to dataset A.
Subset corresponding to dataset B.
Subset corresponding to dataset C.
Lewaa, I., Hafez, M. S., and Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications and Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) # call DPMPM output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 500, burn = 50, thin = 50, K = 80, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = FALSE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) names(step_first_aii) }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) # call DPMPM output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 500, burn = 50, thin = 50, K = 80, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = FALSE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) names(step_first_aii) }
Second step of the SMA procedure using nearest-neighbour hot deck matching on shared variables (Y1, Z1) to fuse datasets while preserving their joint distribution.
sma1_step2(step1)sma1_step2(step1)
step1 |
list returned by the SMA step 1 procedure |
A list with:
The final fused dataset A.
NNDD matching results.
Lewaa, I., Hafez, M. S., Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) # call DPMPM (reduced settings for speed) output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) step_second_aii1 <- sma1_step2(step_first_aii) result_aii1 <- step_second_aii1$datA_fused_2 head(result_aii1) }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) # call DPMPM (reduced settings for speed) output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) step_second_aii1 <- sma1_step2(step_first_aii) result_aii1 <- step_second_aii1$datA_fused_2 head(result_aii1) }
Second step of SMA using multinomial models and nearest-neighbour hot deck on fitted probabilities to improve data fusion accuracy.
sma2_step2(step1)sma2_step2(step1)
step1 |
list returned by the SMA step 1 procedure |
A list with:
A data frame containing the final fused (imputed) version of dataset A.
A list with results from the nearest-neighbour distance matching procedure.
A formula object specifying the model used.
A data frame of fitted values obtained for dataset A.
A data frame of fitted values obtained for dataset C.
Lewaa, I., Hafez, M. S., Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) step_second_aii2 <- sma2_step2(step_first_aii) result_aii2 <- step_second_aii2$datA_fused_2 head(result_aii2) }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) step_second_aii2 <- sma2_step2(step_first_aii) result_aii2 <- step_second_aii2$datA_fused_2 head(result_aii2) }
Second step of SMA is to generate the target variable using multinomial probability distributions derived from a donor-based model to achieve statistically coherent data fusion
sma3_step2(step1)sma3_step2(step1)
step1 |
list returned by the SMA step 1 procedure |
A list with:
A data frame containing the final fused version of dataset A.
A numeric vector of estimated model coefficients.
A matrix or data frame of dummy variables constructed for dataset A.
A matrix or data frame of dummy variables constructed for dataset B.
A numeric matrix of predicted probabilities for each category of variable Z (rows correspond to observations).
A binary matrix (one-hot encoded) sampled from prob_Z_all, where each row contains a single 1 indicating the selected category of Z.
Lewaa, I., Hafez, M. S., Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) # call DPMPM (reduced settings for speed) output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) step_second_aii3 <- sma1_step2(step_first_aii) result_aii3 <- step_second_aii3$datA_fused_2 head(result_aii3) }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) datC <- data.frame( X = datB$X[1:4], Y1 = datA$Y1[1:4], Z1 = datB$Z1[1:4] ) # adding auxiliary information datABC <- rbind(datAB, datC) # call DPMPM (reduced settings for speed) output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datABC, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first_aii <- sma_step1(datA, datB, datC, output_list_AII) step_second_aii3 <- sma1_step2(step_first_aii) result_aii3 <- step_second_aii3$datA_fused_2 head(result_aii3) }
Identifies and selects the imputed dataset that minimizes the Hellinger distance between the reference and synthetic distributions, thereby achieving the highest level of distributional similarity and improving the statistical consistency of the imputation.
smc_step1(datA, datB, output_ll)smc_step1(datA, datB, output_ll)
datA |
data.frame A |
datB |
data.frame B |
output_ll |
list with imputed datasets (impdata) |
A list with imputed datasets:
The full imputed dataset obtained after combining A and B.
Subset of datAB_imp1 corresponding to dataset A.
Subset of datAB_imp1 corresponding to dataset B.
Lewaa, I., Hafez, M. S., Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) str(step_first$datA_imp1) str(step_first$datB_imp1) }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) str(step_first$datA_imp1) str(step_first$datB_imp1) }
Performs nearest-neighbour hot deck imputation on the selected imputed datasets by matching recipient units in dataset A with donor units in dataset B based on common variables. The procedure transfers the target variable from the nearest donor to construct a statistically coherent fused dataset.
smc1_step2(step1)smc1_step2(step1)
step1 |
output from |
A list with:
The final fused dataset A after step 2 of the procedure.
A list containing nearest-neighbour distance matching results.
Lewaa, I., Hafez, M. S., and Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications and Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) step_second <- smc1_step2(step_first) result <- step_second$datA_fused_2 result }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) step_second <- smc1_step2(step_first) result <- step_second$datA_fused_2 result }
Implements a model-based hot deck data fusion approach by estimating multinomial models on both datasets and matching observations based on their fitted probability distributions
smc2_step2(step1)smc2_step2(step1)
step1 |
output from |
A list with:
A data frame containing the final fused (imputed) version of dataset A.
A list with results from the nearest-neighbour distance matching procedure.
A formula object used to fit the models.
A data frame of fitted values obtained from the model for dataset A.
A data frame of fitted values obtained from the model for dataset B.
Lewaa, I., Hafez, M. S., and Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications and Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) step_second <- smc2_step2(step_first) result <- step_second$datA_fused_2 result }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) step_second <- smc2_step2(step_first) result <- step_second$datA_fused_2 result }
Generates the target variable using multinomial simulation based on fitted probability distributions to preserve uncertainty and variability
smc3_step2(step1)smc3_step2(step1)
step1 |
output from |
A list with:
A data frame containing the final fused (imputed) version of dataset A.
A data frame of fitted values obtained from the model for dataset A.
Lewaa, I., Hafez, M. S., and Ismail, M. A. (2023). Mixed Statistical Matching Approaches Using a Latent Class Model: Simulation Studies. Journal of Statistics Applications and Probability, 12(1), 247–265.
D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. John Wiley and Sons.
D'Orazio, M., Di~Zio, M., and Scanu, M. (2019). Auxiliary variable selection in a statistical matching problem.
Zhang, L.-C., and Chambers, R. Analysis of integrated data. CRC/Chapman and Hall, pp. 101–120.
Conti, P. L., Marella, D., and Scanu, M. (2016). Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111(516), 1715–1725.
if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) step_second <- smc3_step2(step_first) result <- step_second$datA_fused_2 result }if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) { data(datA) data(datB) datAB <- datAB_to_SM(datA, datB) # call DPMPM (reduced settings for speed) output_list <- NPBayesImputeCat::DPMPM_nozeros_imp( X = datAB, nrun = 50, burn = 10, thin = 10, K = 20, aalpha = 0.25, balpha = 0.25, m = 2, seed = 1234, silent = TRUE ) step_first <- smc_step1(datA, datB, output_list) step_second <- smc3_step2(step_first) result <- step_second$datA_fused_2 result }