Title: | Regression and Clustering in Multivariate Response Scenarios |
---|---|
Description: | Fitting multivariate response models with random effects on one or two levels; whereby the (one-dimensional) random effect represents a latent variable approximating the multivariate space of outcomes, after possible adjustment for covariates. The method is particularly useful for multivariate, highly correlated outcome variables with unobserved heterogeneities. Applications include regression with multivariate responses, as well as multivariate clustering or ranking problems. See Zhang and Einbeck (2024) <doi:10.1007/s42519-023-00357-0>. |
Authors: | Yingjuan Zhang [aut, cre], Jochen Einbeck [aut, ctb] |
Maintainer: | Yingjuan Zhang <[email protected]> |
License: | GPL-3 |
Version: | 0.2.0 |
Built: | 2024-10-25 05:30:52 UTC |
Source: | CRAN |
The data were recorded via 4D ultrasound scans from 40 fetuses (20 before Covid and 20 during Covid) at 32 weeks gestation, and consist of the number of movements each fetus carries out in relation to the recordable scan length.
data(fetal_covid_data)
data(fetal_covid_data)
An object of class "data.frame"
Inner Brow Raiser, Outer Brow Raiser, Brow Lower, Cheek Raiser, Nose Wrinkle.
Turn Right, Turn Left, Up, Down.
Upper Lip Raiser, Nasolabial Furrow, Lip Puller, Lower Lip Depressor, Lip Pucker, Tongue Show, Lip Stretch, Lip Presser, Lip Suck, Lips Parting, Jaw Drop, Mouth Stretch.
Upper Face, Side Face, Lower Face, Mouth Area.
All scans were coded for eye blink.
"during the pandemic" is coded by 1, "before the pandemic" is coded by 0.
specifies whether it is during or before the pandemic.
Reissland, N., Ustun, B. and Einbeck, J. (2024). The effects of lockdown during the COVID-19 pandemic on fetal movement profiles. BMC Pregnancy and Childbirth, 24(1), 1-7.
data(fetal_covid_data) head(fetal_covid_data)
data(fetal_covid_data) head(fetal_covid_data)
The data is obtained from the International Adult Literacy Survey (IALS), collected in 13 countries on Prose, Document, and Quantitative scales between 1994 and 1995. The data are reported as the percentage of individuals who could not reach a basic level of literacy in each country.
data(IALS_data)
data(IALS_data)
An object of class "data.frame"
On prose scale, the percentage of individuals who could not reach a basic level of literacy in each country.
On document scale, the percentage of individuals who could not reach a basic level of literacy in each country.
On quantitative scale, the percentage of individuals who could not reach a basic level of literacy in each country.
Specify the country
Specify the gender
Sofroniou, N., Hoad, D., & Einbeck, J. (2008). League tables for literacy survey data based on random effect models. In: Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht; pp. 402-405.
data(IALS_data) head(IALS_data)
data(IALS_data) head(IALS_data)
This function is used to obtain the Maximum Likelihood Estimates (MLE) using the EM algorithm for one-level multivariate data. The estimates enable users to conduct clustering, ranking, and simultaneous dimension reduction on the multivariate dataset. Furthermore, when covariates are included, the function supports the fitting of multivariate response models, expanding its utility for regression analysis. The details of the model used in this function can be found in Zhang and Einbeck (2024). Note that this function is designed for multivariate data. When the dimension of the data is 1, please use alldist as an alternative. A warning message will also be displayed when the input data is a univariate dataset.
data |
A data set object; we denote the dimension to be |
v |
Covariate(s). |
K |
Number of mixture components, the default is |
steps |
Number of iterations, the default is |
start |
Containing parameters involved in the proposed model ( |
option |
Four options for selecting the starting values for the parameters in the model. The default is option = 1. More details can be found in start_em. |
var_fun |
There are four types of variance specifications;
|
The estimated parameters in the model obtained through the EM algorithm at the convergence.
p |
The estimates for the parameter |
alpha |
The estimates for the parameter |
z |
The estimates for the parameter |
beta |
The estimates for the parameter |
gamma |
The estimates for the parameter |
sigma |
The estimates for the parameter |
W |
The posterior probability matrix. |
loglikelihood |
The approximated log-likelihood of the fitted model. |
disparity |
The disparity ( |
number_parameters |
The number of parameters estimated in the EM algorithm. |
AIC |
The AIC value ( |
BIC |
The BIC value ( |
starting_values |
A list of starting values for parameters used in the EM algorithm. |
It is worth noting that due to the sequential nature of the updates within the M-step, this algorithm can be considered an ECM algorithm.
Zhang, Y. and Einbeck, J. (2024). A Versatile Model for Clustered and Highly Correlated Multivariate Data. J Stat Theory Pract 18(5).doi:10.1007/s42519-023-00357-0
##example for data without covariates. data(faithful) res <- mult.em_1level(faithful,K=2,steps = 10,var_fun = 1) ## Graph showing the estimated one-dimensional space with cluster centers in red and alpha in green. x <- res$alpha[1]+res$beta[1]*res$z y <- res$alpha[2]+res$beta[2]*res$z plot(faithful,col = 8) points(x=x[1],y=y[1],type = "p",col = "red",pch = 17) points(x=x[2],y=y[2],type = "p",col = "red",pch = 17) points(x=res$alpha[1],y=res$alpha[2],type = "p",col = "darkgreen",pch = 4) slope <- (y[2]-y[1])/(x[2]-x[1]) intercept <- y[1]-slope*x[1] abline(intercept, slope, col="red") ##Graph showing the originaldata points being assigned to different ##clusters according to the Maximum a posterior (MAP) rule. index <- apply(res$W, 1, which.max) faithful_grouped <- cbind(faithful,index) colors <- c("#FDAE61", "#66BD63") plot(faithful_grouped[,-3], pch = 1, col = colors[factor(index)]) ##example for data with covariates. data(fetal_covid_data) set.seed(2) covid_res <- mult.em_1level(fetal_covid_data[,c(1:5)],v=fetal_covid_data$status_bi, K=3, steps = 20, var_fun = 2) coeffs <- covid_res$gamma ##compare with regression coefficients from fitting individual linear models. summary(lm( UpperFaceMovements ~ status_bi,data=fetal_covid_data))$coefficients[2,1] summary(lm( Headmovements ~ status_bi,data=fetal_covid_data))$coefficients[2,1]
##example for data without covariates. data(faithful) res <- mult.em_1level(faithful,K=2,steps = 10,var_fun = 1) ## Graph showing the estimated one-dimensional space with cluster centers in red and alpha in green. x <- res$alpha[1]+res$beta[1]*res$z y <- res$alpha[2]+res$beta[2]*res$z plot(faithful,col = 8) points(x=x[1],y=y[1],type = "p",col = "red",pch = 17) points(x=x[2],y=y[2],type = "p",col = "red",pch = 17) points(x=res$alpha[1],y=res$alpha[2],type = "p",col = "darkgreen",pch = 4) slope <- (y[2]-y[1])/(x[2]-x[1]) intercept <- y[1]-slope*x[1] abline(intercept, slope, col="red") ##Graph showing the originaldata points being assigned to different ##clusters according to the Maximum a posterior (MAP) rule. index <- apply(res$W, 1, which.max) faithful_grouped <- cbind(faithful,index) colors <- c("#FDAE61", "#66BD63") plot(faithful_grouped[,-3], pch = 1, col = colors[factor(index)]) ##example for data with covariates. data(fetal_covid_data) set.seed(2) covid_res <- mult.em_1level(fetal_covid_data[,c(1:5)],v=fetal_covid_data$status_bi, K=3, steps = 20, var_fun = 2) coeffs <- covid_res$gamma ##compare with regression coefficients from fitting individual linear models. summary(lm( UpperFaceMovements ~ status_bi,data=fetal_covid_data))$coefficients[2,1] summary(lm( Headmovements ~ status_bi,data=fetal_covid_data))$coefficients[2,1]
This function extends the one-level version mult.em_1level, and it is designed to obtain Maximum Likelihood Estimates (MLE) using the EM algorithm for nested (structured) multivariate data, e.g. multivariate test scores (such as on numeracy, literacy) of students nested in different classes or schools. The resulting estimates can be applied for clustering or constructing league tables (ranking of observations). With the inclusion of covariates, the model allows fitting a multivariate response model for further regression analysis. Detailed information about the model used in this function can be found in Zhang et al. (2023). Note that this function is designed for multivariate data. When the dimension of the data is 1, please use allvc as an alternative. A warning message will also be displayed when the input data is a univariate dataset.
data |
A data set object; we denote the dimension to be |
v |
Covariate(s). |
K |
Number of mixture components, the default is |
steps |
Number of iterations, the default is |
start |
Containing parameters involved in the proposed model ( |
option |
Four options for selecting the starting values for the parameters in the model. The default is |
var_fun |
There are two types of variance specifications; |
The estimated parameters in the model obtained through the EM algorithm,
where the upper-level unit is indexed by
, and the lower-level unit is indexed by
.
p |
The estimates for the parameter |
alpha |
The estimates for the parameter |
z |
The estimates for the parameter |
beta |
The estimates for the parameter |
gamma |
The estimates for the parameter |
sigma |
The estimates for the parameter |
W |
The posterior probability matrix. |
loglikelihood |
The approximated log-likelihood of the fitted model. |
disparity |
The disparity ( |
number_parameters |
The number of parameters estimated in the EM algorithm. |
AIC |
The AIC value ( |
starting_values |
A list of starting values for parameters used in the EM algorithm. |
It is worth noting that due to the sequential nature of the updates within the M-step, this algorithm can be considered an ECM algorithm.
Zhang, Y., Einbeck, J. and Drikvandi, R. (2023). A multilevel multivariate response model for data with latent structures. In: Proceedings of the 37th International Workshop on Statistical Modelling, pages 343-348. Link on RG: https://www.researchgate.net/publication/375641972_A_multilevel_multivariate_response_model_for_data_with_latent_structures
##examples for data without covariates. data(trading_data) set.seed(49) trade_res <- mult.em_2level(trading_data, K=4, steps = 10, var_fun = 2) i_1 <- apply(trade_res$W, 1, which.max) ind_certain <- rep(as.vector(i_1),c(4,5,5,3,5,5,4,4,5,5,5,5,5,5,5,5,5,5, 3,5,5,5,5,4,4,5,5,5,4,5,4,5,5,5,3,5,5,5,5,5,5,4,5,4)) colors <- c("#FF6600","#66BD63", "lightpink","purple") plot(trading_data[,-3],pch = 1, col = colors[factor(ind_certain)]) legend("topleft", legend=c("Mass point 1", "Mass point 2","Mass point 3","Mass point 4"), col=c("#FF6600","purple","#66BD63","lightpink"),pch = 1, cex=0.8) ###The Twins data library(lme4) set.seed(26) twins_res <- mult.em_2level(twins_data[,c(1,2,3)],v=twins_data[,c(4,5,6)], K=2, steps = 20, var_fun = 2) coeffs <- twins_res$gamma ##Compare to the estimated coefficients obtained using individual two-level models (lmer()). summary(lmer(SelfTouchCodable ~ Depression + PSS + Anxiety + (1 | id) , data=twins_data, REML = TRUE))$coefficients[2,1]
##examples for data without covariates. data(trading_data) set.seed(49) trade_res <- mult.em_2level(trading_data, K=4, steps = 10, var_fun = 2) i_1 <- apply(trade_res$W, 1, which.max) ind_certain <- rep(as.vector(i_1),c(4,5,5,3,5,5,4,4,5,5,5,5,5,5,5,5,5,5, 3,5,5,5,5,4,4,5,5,5,4,5,4,5,5,5,3,5,5,5,5,5,5,4,5,4)) colors <- c("#FF6600","#66BD63", "lightpink","purple") plot(trading_data[,-3],pch = 1, col = colors[factor(ind_certain)]) legend("topleft", legend=c("Mass point 1", "Mass point 2","Mass point 3","Mass point 4"), col=c("#FF6600","purple","#66BD63","lightpink"),pch = 1, cex=0.8) ###The Twins data library(lme4) set.seed(26) twins_res <- mult.em_2level(twins_data[,c(1,2,3)],v=twins_data[,c(4,5,6)], K=2, steps = 20, var_fun = 2) coeffs <- twins_res$gamma ##Compare to the estimated coefficients obtained using individual two-level models (lmer()). summary(lmer(SelfTouchCodable ~ Depression + PSS + Anxiety + (1 | id) , data=twins_data, REML = TRUE))$coefficients[2,1]
This package implements methodology for the estimation of multivariate response models with random effects on one or two levels;
whereby the (one-dimensional) random effect represents a latent variable approximating the multivariate space of outcomes,
after possible adjustment for covariates. The estimation methodology makes use of a nonparametric maximum likelihood-type approach,
where the random effect distribution is approximated by a discrete mixture, hence allowing the use of the EM algorithm for the estimation of all model parameters.
The method is particularly useful for multivariate,
highly correlated outcome variables with unobserved heterogeneities. Applications include regression with multivariate responses,
as well as multivariate clustering or ranking problems.
The details of the models can be found in Zhang and Einbeck (2024) and Zhang et al. (2023).
The main functions are mult.em_1level
and mult.em_2level
for the fitting of the raw models, as well as envelope functions
mult.reg_1level
and mult.reg_2level
which facilitate iterative runs of the algorithm with a view to
finding optimal starting points, with help by function start_em
.
Package: mult.latent.reg
Type: Package
License: GPL-3
Yingjuan Zhang <[email protected]>
Jochen Einbeck
Zhang, Y., Einbeck, J., and Drikvandi, R. (2023). A multilevel multivariate response model for data with latent structures. In: Proceedings of the 37th International Workshop on Statistical Modelling, Dortmund; pages 343-348. Link on RG: https://www.researchgate.net/publication/375641972_A_multilevel_multivariate_response_model_for_data_with_latent_structures.
Zhang, Y. and Einbeck, J. (2024). A Versatile Model for Clustered and Highly Correlated Multivariate Data. J Stat Theory Pract 18(5).doi:10.1007/s42519-023-00357-0
This wrapper function runs multiple times the function mult.em_1level for fitting Zhang and Einbeck's (2024) multivariate response models with one-level random effect, and select the best results with the smallest AIC value.
data |
A data set object; we denote the dimension of a data set to be |
v |
Covariate(s). |
K |
Number of mixture components, the default is |
steps |
Number of iterations within each |
num_runs |
Number of function iteration runs, the default is |
start |
Containing parameters involved in the proposed model ( |
option |
Four options for selecting the starting values for the parameters in the model. The default is |
var_fun |
There are four types of variance specifications;
|
The best estimated result (with the smallest AIC value) in the model (Zhang and Einbeck, 2024) obtained through the EM algorithm.
p |
The estimates for the parameter |
alpha |
The estimates for the parameter |
z |
The estimates for the parameter |
beta |
The estimates for the parameter |
gamma |
The estimates for the parameter |
sigma |
The estimates for the parameter |
W |
The posterior probability matrix. |
loglikelihood |
The approximated log-likelihood of the fitted model. |
disparity |
The disparity ( |
number_parameters |
The number of parameters estimated in the EM algorithm. |
AIC |
The AIC value ( |
BIC |
The BIC value ( |
aic_data |
All AIC values in each run. |
Starting_values |
Lists of starting values for parameters used in each |
Zhang, Y. and Einbeck J. (2024). A Versatile Model for Clustered and Highly Correlated Multivariate Data. J Stat Theory Pract 18(5).doi:10.1007/s42519-023-00357-0
##run the mult.em_1level() multiple times and select the best results with the smallest AIC value set.seed(7) results <- mult.reg_1level(fetal_covid_data[,c(1:5)],v=fetal_covid_data$status_bi, K=3, num_runs = 5,steps = 20, var_fun = 2, option = 1) ##Reproduce the best result: the best result is the 5th run in the above example. rep_best_result <- mult.em_1level(fetal_covid_data[,c(1:5)], v=fetal_covid_data$status_bi, K=3, steps = 20, var_fun = 2, option = 1, start = results$Starting_values[[5]])
##run the mult.em_1level() multiple times and select the best results with the smallest AIC value set.seed(7) results <- mult.reg_1level(fetal_covid_data[,c(1:5)],v=fetal_covid_data$status_bi, K=3, num_runs = 5,steps = 20, var_fun = 2, option = 1) ##Reproduce the best result: the best result is the 5th run in the above example. rep_best_result <- mult.em_1level(fetal_covid_data[,c(1:5)], v=fetal_covid_data$status_bi, K=3, steps = 20, var_fun = 2, option = 1, start = results$Starting_values[[5]])
This wrapper function runs multiple times the function mult.em_2level for fitting Zhang et al.'s (2023) multivariate response models with two-level random effect, and select the best results with the smallest AIC value.
data |
A data set object; we denote the dimension of a data set to be |
v |
Covariate(s). |
K |
Number of mixture components, the default is |
steps |
Number of iterations within each |
num_runs |
Number of function iteration runs, the default is |
start |
Containing parameters involved in the proposed model ( |
option |
Four options for selecting the starting values for the parameters in the model. The default is |
var_fun |
There are two types of variance specifications; |
The best estimated result (with the smallest AIC value) in the model obtained through the EM algorithm (Zhang et al., 2023),
where the upper-level unit is indexed by
, and the lower-level unit is indexed by
.
p |
The estimates for the parameter |
alpha |
The estimates for the parameter |
z |
The estimates for the parameter |
beta |
The estimates for the parameter |
gamma |
The estimates for the parameter |
sigma |
The estimates for the parameter |
W |
The posterior probability matrix. |
loglikelihood |
The approximated log-likelihood of the fitted model. |
disparity |
The disparity ( |
number_parameters |
The number of parameters estimated in the EM algorithm. |
AIC |
The AIC value ( |
aic_data |
All AIC values in each run. |
Starting_values |
Lists of starting values for parameters used in each |
Zhang, Y., Einbeck, J. and Drikvandi, R. (2023). A multilevel multivariate response model for data with latent structures. In: Proceedings of the 37th International Workshop on Statistical Modelling, pages 343-348. Link on RG: https://www.researchgate.net/publication/375641972_A_multilevel_multivariate_response_model_for_data_with_latent_structures
##run the mult.em_2level() multiple times and select the best results with the smallest AIC value set.seed(7) results <- mult.reg_2level(trading_data, K=4, steps = 10, num_runs = 5, var_fun = 2, option = 1) ## Reproduce the best result: the best result is the 2nd run in the above example. rep_best_result <- mult.em_2level(trading_data, K=4, steps = 10, var_fun = 2, option = 1, start = results$Starting_values[[2]])
##run the mult.em_2level() multiple times and select the best results with the smallest AIC value set.seed(7) results <- mult.reg_2level(trading_data, K=4, steps = 10, num_runs = 5, var_fun = 2, option = 1) ## Reproduce the best result: the best result is the 2nd run in the above example. rep_best_result <- mult.em_2level(trading_data, K=4, steps = 10, var_fun = 2, option = 1, start = results$Starting_values[[2]])
The starting values for parameters used for the EM algorithm in the functions: mult.em_1level, mult.em_2level, mult.reg_1level and mult.reg_2level.
data |
A data set object; we denote the dimension of a data set to be |
v |
Covariate(s); we denote the dimension of it to be |
K |
Number of mixture components, the default is |
steps |
Number of iterations. This will only be used when using |
option |
Four options for selecting the starting values for the parameters. The default is |
var_fun |
The four variance specifications. When |
p |
optional; specifies starting values for |
z |
optional; specifies starting values for |
beta |
optional; specifies starting values for |
alpha |
optional; specifies starting values for |
sigma |
optional; specifies starting values for |
gamma |
optional; the coefficients for the covariates; specifies starting values for |
The starting values (in a list) for parameters in the models (Zhang and Einbeck, 2024) and
(Zhang et al., 2023) used in the four fucntions: mult.em_1level, mult.em_2level, mult.reg_1level and mult.reg_2level.
p |
The starting value for the parameter |
alpha |
The starting value for the parameter |
z |
The starting value for the parameter |
beta |
The starting value for the parameter |
gamma |
The starting value for the parameter |
sigma |
The starting value for the parameter |
Zhang, Y., Einbeck, J. and Drikvandi, R. (2023). A multilevel multivariate response model for data with latent structures. In: Proceedings of the 37th International Workshop on Statistical Modelling, pages 343-348. Link on RG: https://www.researchgate.net/publication/375641972_A_multilevel_multivariate_response_model_for_data_with_latent_structures.
Zhang, Y. and Einbeck, J. (2024). A Versatile Model for Clustered and Highly Correlated Multivariate Data. J Stat Theory Pract 18(5).doi:10.1007/s42519-023-00357-0
##example for the faithful data. data(faithful) start <- start_em(faithful, option = 1)
##example for the faithful data. data(faithful) start <- start_em(faithful, option = 1)
The variables are given as the percentage of imports and exports in relation to the overall GDP. The data set comprises data from 44 countries (for our analysis), we specifically selected the time period between 2018 and 2022.
data(trading_data)
data(trading_data)
An object of class "data.frame"
The fetus from the same twins share the same id number.
frequency of self-touch for each fetus.
frequency of twin-to-twin for each fetus.
Trade in Goods and Services. https://data.oecd.org/trade/trade-in-goods-and-services.htm. Accessed on 2023-05-29.
data(trading_data) head(trading_data)
data(trading_data) head(trading_data)
This data was collected for research on the effects of maternal mental health on prenatal movements in twins and singletons (Reissland et al., 2021). There are two touch movement types of the fetus recorded: self-touch and twin-to-twin touch, and the mothers’ mental health status was collected on three variables: depression, perceived stress scale and stress. There are 14 pairs of twins, 11 of the mothers were available for one scan and 3 of them were available for two scans, i.e. in total there are 34 observations. This dataset contains only the twins data from the original study.
data(twins_data)
data(twins_data)
An object of class "data.frame"
The fetus from the same twins share the same id number.
frequency of self-touch for each fetus.
frequency of twin-to-twin for each fetus.
Depression scale of the mothers.
Perceived Stress Scale of the mothers.
Hospital Anxiety of the mothers.
Reissland, N., Einbeck, J., Wood, R., and Lane, A. (2021). Effects of maternal mental health on prenatal movement profiles in twins and singletons. Acta Paediatrica, 110(9):2553–2558.
data(twins_data) head(twins_data)
data(twins_data) head(twins_data)