Title: | Data Fusion using Optimal Transportation Theory |
---|---|
Description: | In the context of data fusion, the package provides a set of functions dedicated to the solving of 'recoding problems' using optimal transportation theory (Gares, Guernec, Savy (2019) <doi:10.1515/ijb-2018-0106> and Gares, Omer (2020) <doi:10.1080/01621459.2020.1775615>). From two databases with no overlapping part except a subset of shared variables, the functions of the package assist users until obtaining a unique synthetic database, where the missing information is fully completed. |
Authors: | Gregory Guernec [aut, cre], Valerie Gares [aut], Pierre Navaro [ctb], Jeremy Omer [ctb], Philippe Saint-Pierre [ctb], Nicolas Savy [ctb] |
Maintainer: | Gregory Guernec <[email protected]> |
License: | GPL-3 |
Version: | 0.1.2 |
Built: | 2025-02-10 07:07:27 UTC |
Source: | CRAN |
This database is a sample of the API program https://www.cde.ca.gov/re/pr/api.asp that ended in 2018.
The sample is extracted from the data api
of the package survey, related to the results of the
county 29 (Nevada).
The database contains information for the 418 schools of this county having at least 100 students.
Missing information has been randomly (and voluntary) added to the awards
and ell
variables (4% and 7% respectively).
Several variables have been voluntary categorized from their initial types.
api29
api29
A data.frame with 418 schools (rows) and 12 variables
the school identifier
the API score in 2000 classed in 3 ordered levels: [200-600]
,(600-800]
,(800-1000]
the school type in a 3 ordered levels factor: Elementary
, Middle
or High School
the school eligible for awards program ? Two possible answers: No
or Yes
. This variable counts 4% of missing information.
the number of core academic courses in the school
the number of students tested in the school
the average class size years K-3 in the school. This variable is stored in a 3-levels factor: Unknown
, <=20
, >20
.
the percentage of parents with postgraduate education stored in a 3 ordered levels factor of percents: 0
, 1-10
, >10
the percentage of English language learners stored in a 4 ordered levels factor: [0-10]
,(10-30]
,(30-50]
,(50-100]
. This variable counts 7% of missing information.
the percentage of students for whom this is the first year at the school, stored in 2 levels: [0-20]
and (20-100]
the percentage of students eligible for subsidized meals stored in a 4 balanced levels factor (By quartiles): [0-25]
, (25-50]
, (50-75]
, (75-100]
the percentage of fully qualified teachers stored in a 2-levels factor: 1
: For strictly less than 90%, 2
otherwise
This database is a sample of the data api
from the package survey.
This database is a sample of the API program https://www.cde.ca.gov/re/pr/api.asp that ended in 2018.
The sample is extracted from the data api
of the package survey, related to the results of the
county 35 (San Benito).
The database contains information for the 362 schools of this county having at least 100 students.
Missing information has been randomly (and voluntary) added to the awards
and ell
variables (4% and 7% respectively).
Several variables have been voluntary categorized from their initial types.
api35
api35
A data.frame with 362 schools (rows) and 12 variables
the school identifier
the API score in 1999 classed in 4 ordered levels: G1
,G2
,G3
, G4
the school type in a 3 ordered levels factor: Elementary
, Middle
or High School
the school eligible for awards program ? Two possible answers: No
or Yes
. This variable counts 4% of missing information.
the number of core academic courses in the school
the number of students tested in the school
the average class size years K-3 in the school. This variable is stored in a 3-levels factor: Unknown
, <=20
, >20
.
the percentage of parents with postgraduate education stored in a 3 ordered levels factor of percents: 0
, 1-10
, >10
the percentage of English language learners stored in a 4 ordered levels factor: [0-10]
,(10-30]
,(30-50]
,(50-100]
. This variable counts 7% of missing information.
the percentage of students for whom this is the first year at the school, stored in 2 levels: 1
and 2
the percentage of students eligible for subsidized meals stored in a 4 balanced levels factor (By quartiles): [0-25]
, (25-50]
, (50-75]
, (75-100]
the percentage of fully qualified teachers stored in a 2-levels factor: 1
: For strictly less than 90%, 2
otherwise
This database is a sample of the data api
from the package survey.
This function computes average distances between levels of two categorical variables located in two distinct databases.
avg_dist_closest(proxim, percent_closest = 1)
avg_dist_closest(proxim, percent_closest = 1)
proxim |
a |
percent_closest |
a ratio between 0 and 1 corresponding to the desired part of rows (or statistical units, or individuals) that will participate to the computation of the average distances between levels of factors or between an individual (a row) and levels of only one factor. Indeed, target variables are factors and each level of factor is characterized by a subset of rows, themselves characterized by their covariate profiles. These rows can be ordered according to their distances at their factor level. When this ratio is set to 1 (default setting), all rows participate to the computation, nevertheless when this ratio is less than 1, only rows with the smallest factor level distances will be kept for the computation (see 'Details'). |
The function avg_dist_closest
is an intermediate function for the implementation of original algorithms dedicated to the solving of recoding problems in data fusion using Optimal Transportation theory (for more details, consult the corresponding algorithms called
OUTCOME
, R_OUTCOME
, JOINT
and R_JOINT
, in the reference (2)). The function avg_dist_closest
is so directly implemented in the OT_outcome
and OT_joint
functions but can also be used separately.
The function avg_dist_closest
uses, in particular, the distance matrix D (that stores distances between rows of A and B) from the function proxim_dist
to produce three distinct matrices saved in a list object.
Therefore, the function requires in input, the specific output of the function proxim_dist
which is available in the package and so must be used beforehand.
In consequence, do not use this function directly on your database, and do not hesitate to consult the provided examples provided for a better understanding.
DEFINITION OF THE COST MATRIX
Assuming that A and B are two databases with a set of shared variables and that a same information (referred to a same target population) is stored as a variable in A and
in B, such that
is unknown in B and
is unknown in A, whose encoding depends on the database (
levels in A and
levels in B).
A distance between one given level y of
and one given level z of
is estimated by averaging the distances between the two subsets of individuals (units or rows) assigned to y in A and z in B, characterized by their vectors of covariates.
The distance between two individuals depends on the variations between the shared covariates, and so depends on the chosen distance function using the function
proxim_dist
.
For these computations, all the individuals concerned by these two levels can be taken into account, or only a part of them, depending on the argument percent_closest
.
When percent_closest
< 1, the average distance between an individual and a given level of factor z only uses the corresponding part of individuals related to z that are the closest to
.
Therefore, this choice influences the estimations of average distances between levels of factors but also permits to reduce time computation when necessary.
The average distance between each individual of (resp.
) and each levels of
(resp.
) are returned in output, in the object
DindivA
(DindivB
respectively).
The average distance between each levels of and each levels of
are returned in a matrix saved in output (the object
Davg
).
Davg
returns the computation of the cost matrix D, whose dimensions () correspond to the number of levels of
(rows) and
(columns).
This matrix can be seen as the ability for an individual (row) to move from a given level of the target variable (
) in A to a given level of
in the database B (or vice versa).
A list of 3 matrices is returned:
Davg |
the cost matrix whose number of rows corresponds to |
DindivA |
a matrix whose number of rows corresponds to the number of rows of the first database A and number of columns corresponds to |
DindivB |
a matrix whose number of rows corresponds to the number of rows of the second database B and number of columns corresponds to nA, the number of levels of the target variable in the first database A.
DindivB[k,P] refers to the average distance between the |
Gregory Guernec, Valerie Gares, Jeremy Omer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
data(simu_data) ### The covariates of the data are prepared according to the distance chosen ### using the transfo_dist function ### Example with The Manhattan distance man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") # proxim_dist() fixes the chosen distance function, # and defines neighborhoods between profiles and individuals # The following row uses only 80 percents of individuals of each level # of factors for the computation of the average distances: neig_man1 <- avg_dist_closest(mat_man1, percent_closest = 0.80)
data(simu_data) ### The covariates of the data are prepared according to the distance chosen ### using the transfo_dist function ### Example with The Manhattan distance man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") # proxim_dist() fixes the chosen distance function, # and defines neighborhoods between profiles and individuals # The following row uses only 80 percents of individuals of each level # of factors for the computation of the average distances: neig_man1 <- avg_dist_closest(mat_man1, percent_closest = 0.80)
This function compares the elements of two lists of same length.
compare_lists(listA, listB)
compare_lists(listA, listB)
listA |
a first list |
listB |
a second list |
A boolean vector of same length as the two lists,
which ith element is TRUE
if the ith element is different
between the 2 lists, or FALSE
otherwise
Gregory Guernec
data1 <- data.frame(Gender = rep(c("m", "f"), 5), Age = rnorm(5, 20, 4)) data2 <- data.frame(Gender = rep(c("m", "f"), 5), Age = rnorm(5, 21, 5)) list1 <- list(A = 1:4, B = as.factor(c("A", "B", "C")), C = matrix(1:6, ncol = 3)) list2 <- list(A = 1:4, B = as.factor(c("A", "B")), C = matrix(1:6, ncol = 3)) list3 <- list(A = 1:4, B = as.factor(c("A", "B", "C")), C = matrix(c(1:5, 7), ncol = 3)) list4 <- list(A = 1:4, B = as.factor(c("A", "B", "C")), C = matrix(1:6, ncol = 2)) list5 <- list(A = 1:4, B = as.factor(c("A", "B")), C = matrix(1:6, ncol = 2)) list6 <- list(A = 1:4, B = as.factor(c("A", "B")), C = data1) list7 <- list(A = 1:4, B = as.factor(c("A", "B")), C = data2) OTrecod::compare_lists(list1, list2) OTrecod::compare_lists(list1, list3) OTrecod::compare_lists(list1, list4) OTrecod::compare_lists(list1, list5) OTrecod::compare_lists(list6, list7)
data1 <- data.frame(Gender = rep(c("m", "f"), 5), Age = rnorm(5, 20, 4)) data2 <- data.frame(Gender = rep(c("m", "f"), 5), Age = rnorm(5, 21, 5)) list1 <- list(A = 1:4, B = as.factor(c("A", "B", "C")), C = matrix(1:6, ncol = 3)) list2 <- list(A = 1:4, B = as.factor(c("A", "B")), C = matrix(1:6, ncol = 3)) list3 <- list(A = 1:4, B = as.factor(c("A", "B", "C")), C = matrix(c(1:5, 7), ncol = 3)) list4 <- list(A = 1:4, B = as.factor(c("A", "B", "C")), C = matrix(1:6, ncol = 2)) list5 <- list(A = 1:4, B = as.factor(c("A", "B")), C = matrix(1:6, ncol = 2)) list6 <- list(A = 1:4, B = as.factor(c("A", "B")), C = data1) list7 <- list(A = 1:4, B = as.factor(c("A", "B")), C = data2) OTrecod::compare_lists(list1, list2) OTrecod::compare_lists(list1, list3) OTrecod::compare_lists(list1, list4) OTrecod::compare_lists(list1, list5) OTrecod::compare_lists(list6, list7)
This function studies the association between two categorical distributions with different numbers of modalities.
error_group(REF, Z, ord = TRUE)
error_group(REF, Z, ord = TRUE)
REF |
a factor with a reference number of levels. |
Z |
a factor with a number of levels greater than the number of levels of the reference. |
ord |
a boolean. If TRUE, only neighboring levels of |
Assuming that and
are categorical variables summarizing a same information, and that one of the two related encodings is unknown by user
because this latter is, for example, the result of predictions provided by a given model or algorithm, the function
error_group
searches for potential links between the modalities of to approach at best the distribution of
.
Assuming that and
have
and
modalities respectively so that
, in a first step, the
function
error_group
combines modalities of to build all possible variables
verifying
.
In a second step, the association between
and each new variable
generated is measured by studying the ratio of concordant pairs related to the confusion matrix but also using standard criterions:
the Cramer's V (1), the Cohen's kappa coefficient (2) and the Spearman's rank correlation coefficient.
According to the type of , different combinations of modalities are tested:
If and
are ordinal (
ord = TRUE
), only consecutive modalities of will be grouped to build the variables
.
If and
are nominal (
ord = FALSE
), all combinations of modalities of (consecutive or not) will be grouped to build the variables
.
All the associations tested are listed in output as a data.frame object.
The function error_group
is directly integrated in the function verif_OT
to evaluate the proximity of two multinomial distributions, when one of them is estimated from the predictions of an OT algorithm.
Example:
Assuming that and
, so
and
and the related coefficient of correlation
is 0.89.
Are there groupings of modalities of
which contribute to improving the proximity between
and
?
From
, the function
error_group
gives an answer to this question by successively constructing the variables: ,
,
and tests
,
,
.
Here, the tests permit to conclude that the difference of encodings between
and
resulted in fact in a simple grouping of modalities.
A data.frame with five columns:
combi |
the first column enumerates all possible groups of modalities of |
error_rate |
the second column gives the corresponding rate error from the confusion matrix (ratio of non-diagonal elements) |
Kappa |
this column indicates the result of the Cohen's kappa coefficient related to each combination of |
Vcramer |
this column indicates the result of the Cramer's V criterion related to each combination of |
RankCor |
this column indicates the result of the Spearman's coefficient of correlation related to each combination of |
Gregory Guernec
Cramér, Harald. (1946). Mathematical Methods of Statistics. Princeton: Princeton University Press.
McHugh, Mary L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica. 22 (3): 276–282
# Basic examples: sample1 <- as.factor(sample(1:3, 50, replace = TRUE)) length(sample1) sample2 <- as.factor(sample(1:2, 50, replace = TRUE)) length(sample2) sample3 <- as.factor(sample(c("A", "B", "C", "D"), 50, replace = TRUE)) length(sample3) sample4 <- as.factor(sample(c("A", "B", "C", "D", "E"), 50, replace = TRUE)) length(sample4) # By only grouping consecutive levels of sample1: error_group(sample1, sample4) # By only all possible levels of sample1, consecutive or not: error_group(sample2, sample1, ord = FALSE) ### using a sample of the tab_test object (3 complete covariates) ### Y1 and Y2 are a same variable encoded in 2 different forms in DB 1 and 2: ### (4 levels for Y1 and 3 levels for Y2) data(tab_test) # Example with n1 = n2 = 70 and only X1 and X2 as covariates tab_test2 <- tab_test[c(1:70, 5001:5070), 1:5] ### An example of JOINT model (Manhattan distance) # Suppose we want to impute the missing parts of Y1 in DB2 only ... try1J <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "M", which.DB = "B" ) # Error rates between Y2 and the predictions of Y1 in the DB 2 # by grouping the levels of Y1: error_group(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred) table(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred)
# Basic examples: sample1 <- as.factor(sample(1:3, 50, replace = TRUE)) length(sample1) sample2 <- as.factor(sample(1:2, 50, replace = TRUE)) length(sample2) sample3 <- as.factor(sample(c("A", "B", "C", "D"), 50, replace = TRUE)) length(sample3) sample4 <- as.factor(sample(c("A", "B", "C", "D", "E"), 50, replace = TRUE)) length(sample4) # By only grouping consecutive levels of sample1: error_group(sample1, sample4) # By only all possible levels of sample1, consecutive or not: error_group(sample2, sample1, ord = FALSE) ### using a sample of the tab_test object (3 complete covariates) ### Y1 and Y2 are a same variable encoded in 2 different forms in DB 1 and 2: ### (4 levels for Y1 and 3 levels for Y2) data(tab_test) # Example with n1 = n2 = 70 and only X1 and X2 as covariates tab_test2 <- tab_test[c(1:70, 5001:5070), 1:5] ### An example of JOINT model (Manhattan distance) # Suppose we want to impute the missing parts of Y1 in DB2 only ... try1J <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "M", which.DB = "B" ) # Error rates between Y2 and the predictions of Y1 in the DB 2 # by grouping the levels of Y1: error_group(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred) table(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred)
This function computes a matrix distance using the Hamming distance as proximity measure.
ham(mat_1, mat_2)
ham(mat_1, mat_2)
mat_1 |
a vector, a matrix or a data.frame of binary values that may contain missing data |
mat_2 |
a vector, a matrix or a data.frame of binary values with the same number of columns as |
ham
returns the pairwise distances between rows (observations) of a single matrix if mat_1
equals mat_2
.
Otherwise ham
returns the matrix distance between rows of the two matrices mat_1
and mat_2
if this 2 matrices are different in input.
Computing the Hamming distance stays possible despite the presence of missing data by applying the following formula. Assuming that A and B are 2 matrices such as ncol(A) = ncol(B)
.
The Hamming distance between the row of A and the
row of B equals:
where: and
; And the expression located to the right term of the multiplication corresponds to a specific weigh applied in presence of NAs in
and/or
.
This specificity is not implemented in the cdist
function and the Hamming distance can not be computed using the dist
function either.
The Hamming distance can not be calculated in only two situations:
If a row of A or B has only missing values (ie for each of the columns of A or B respectively).
The union of the indexes of the missing values in row i of A with the indexes of the missing values in row j of B concerns the indexes of all considered columns.
Example: Assuming that , if
and
, for each column, either the information in row i is missing in A,
or the information is missing in B, which induces:
.
If mat_1
is a vector and mat_2
is a matrix (or data.frame) or vice versa, the length of mat_1
must be equal to the number of columns of mat_2
.
A distance matrix
Gregory Guernec
Roth R (2006). Introduction to Coding Theory. Cambridge University Press.
set.seed(3010) sample_A <- sample(c(0, 1), 12, replace = TRUE) set.seed(3007) sample_B <- sample(c(0, 1), 15, replace = TRUE) A <- matrix(sample_A, ncol = 3) B <- matrix(sample_B, ncol = 3) # These 2 matrices have no missing values # Matrix of pairwise distances with A: ham(A, A) # Matrix of distances between the rows of A and the rows of B: ham(A, B) # If mat_1 is a vector of binary values: ham(c(0, 1, 0), B) # Now by considering A_NA and B_NA two matrices built from A and B respectively, # where missing values have been manually added: A_NA <- A A_NA[3, 1] <- NA A_NA[2, 2:3] <- rep(NA, 2) B_NA <- B B_NA[2, 2] <- NA ham(A_NA, B_NA)
set.seed(3010) sample_A <- sample(c(0, 1), 12, replace = TRUE) set.seed(3007) sample_B <- sample(c(0, 1), 15, replace = TRUE) A <- matrix(sample_A, ncol = 3) B <- matrix(sample_B, ncol = 3) # These 2 matrices have no missing values # Matrix of pairwise distances with A: ham(A, A) # Matrix of distances between the rows of A and the rows of B: ham(A, B) # If mat_1 is a vector of binary values: ham(c(0, 1, 0), B) # Now by considering A_NA and B_NA two matrices built from A and B respectively, # where missing values have been manually added: A_NA <- A A_NA[3, 1] <- NA A_NA[2, 2:3] <- rep(NA, 2) B_NA <- B B_NA[2, 2] <- NA ham(A_NA, B_NA)
This function performs imputations on incomplete covariates, whatever their types, using functions from the package MICE (Van Buuren's Multiple Imputation) or functions from the package missMDA (Simple Imputation with Multivariate data analysis).
imput_cov( dat1, indcol = 1:ncol(dat1), R_mice = 5, meth = rep("pmm", ncol(dat1)), missMDA = FALSE, NB_COMP = 3, seed_choice = sample(1:1e+06, 1) )
imput_cov( dat1, indcol = 1:ncol(dat1), R_mice = 5, meth = rep("pmm", ncol(dat1)), missMDA = FALSE, NB_COMP = 3, seed_choice = sample(1:1e+06, 1) )
dat1 |
a data.frame containing the variables to be imputed and those involved in the imputations |
indcol |
a vector of integers. The corresponding column indexes (or numbers) corresponding to the variables to be imputed and those involved in the imputations. |
R_mice |
an integer. The number of imputed database generated with MICE method (5 by default). |
meth |
a vector of characters which specifies the imputation method to be used for each column in |
missMDA |
a boolean. If |
NB_COMP |
an integer corresponding to the number of components used in FAMD to predict the missing entries (3 by default) when the |
seed_choice |
an integer used as argument by the set.seed() for offsetting the random number generator (Random integer by default) |
By default, the function impute_cov
handles missing information using multivariate imputation by chained equations (MICE, see (1) for more details about the method) by integrating in its syntax the function mice
.
All values of this last function are taken by default, excepted the required number of multiple imputations, which can be fixed by using the argument R_mice
, and the chosen imputation method for each variable (meth
argument),
that corresponds to the argument defaultMethod
of the function mice
.
When multiple imputations are required (for MICE only), each missing information is imputed by a consensus value:
the average of the candidate values will be retained for numerical variables, while the most frequent class will be remained for categorical variables (ordinal or not).
The output MICE_IMPS
stores the imputed databases to allow users to build their own consensus values by themselves and(or) to eventually assess the variabilities related to the proposed imputed values if necessary.
For this method, a random number generator must be fixed or sampled using the argument seed_choice
.
When the argument missMDA
is equalled to TRUE
, incomplete values are replaced (single imputation) using a method based on dimensionality reduction called factor analysis for mixed data (FAMD) using the the imputeFAMD
function of the missMDA package (2).
Using this approach, the function imput_cov
keeps all the default values integrated in the function imputeFAMD
excepted the number of dimensions used for FAMD which can be fixed by users (3 by default).
A list of 3 or 4 objects (depending on the missMDA argument). The first three following objects if missMDA
= TRUE, otherwise 4 objects are returned:
RAW |
a data.frame corresponding to the raw database |
IMPUTE |
a character indicating the type of selected imputation |
DATA_IMPUTE |
a data.frame corresponding to the completed (consensus if multiple imputations) database |
MICE_IMPS |
only if missMDA = FALSE. A list object containing the R imputed databases generated by MICE |
Gregory Guernec
van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67. urlhttps://www.jstatsoft.org/v45/i03/
Josse J, Husson F (2016). missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1–31. doi:10.18637/jss.v070.i01
# Imputation of all incomplete covariates in the table simu_data: data(simu_data) # Here we keep the complete variable "Gender" in the imputation model. # Using MICE (REP = 3): imput_mice <- imput_cov(simu_data, indcol = 4:8, R_mice = 3, meth = c("logreg", "polyreg", "polr", "logreg", "pmm") ) summary(imput_mice) # Using FAMD (NB_COMP = 3): imput_famd <- imput_cov(simu_data, indcol = 4:8, meth = c("logreg", "polyreg", "polr", "logreg", "pmm"), missMDA = TRUE ) summary(imput_famd)
# Imputation of all incomplete covariates in the table simu_data: data(simu_data) # Here we keep the complete variable "Gender" in the imputation model. # Using MICE (REP = 3): imput_mice <- imput_cov(simu_data, indcol = 4:8, R_mice = 3, meth = c("logreg", "polyreg", "polr", "logreg", "pmm") ) summary(imput_mice) # Using FAMD (NB_COMP = 3): imput_famd <- imput_cov(simu_data, indcol = 4:8, meth = c("logreg", "polyreg", "polr", "logreg", "pmm"), missMDA = TRUE ) summary(imput_famd)
This function sequentially assigns individual predictions using a nearest neighbors procedure to solve recoding problems of data fusion.
indiv_grp_closest( proxim, jointprobaA = NULL, jointprobaB = NULL, percent_closest = 1, which.DB = "BOTH" )
indiv_grp_closest( proxim, jointprobaA = NULL, jointprobaB = NULL, percent_closest = 1, which.DB = "BOTH" )
proxim |
a |
jointprobaA |
a matrix whose number of columns corresponds to the number of modalities of the target variable |
jointprobaB |
a matrix whose number of columns equals to the number of modalities of the target variable |
percent_closest |
a value between 0 and 1 (by default) corresponding to the fixed |
which.DB |
a character string (with quotes) that indicates which individual predictions need to be computed: only the individual predictions of |
A. THE RECODING PROBLEM IN DATA FUSION
Assuming that and
are two variables which refered to the same target population in two separate databases A and B respectively (no overlapping rows),
so that
and
are never jointly observed. Assuming also that A and B share a subset of common covariates
of any types (same encodings in A and B)
completed or not. Integrating these two databases often requires to solve the recoding problem by creating an unique database where
the missing information of
and
is fully completed.
B. DESCRIPTION OF THE FUNCTION
The function indiv_grp_closest
is an intermediate function used in the implementation of an algorithm called OUTCOME (and its enrichment R-OUTCOME, see the reference (2) for more details) dedicated to the solving of recoding problems in data fusion using Optimal Transportation theory.
The model is implemented in the function OT_outcome
which integrates the function indiv_grp_closest
in its syntax as a possible second step of the algorithm.
The function indiv_grp_closest
can also be used separately provided that the argument proxim
receives an output object of the function proxim_dist
.
This latter is available in the package and is so directly usable beforehand.
The algorithms OUTCOME
(and R-OUTCOME
) are made of two independent parts. Assuming that the objective consists in the prediction of in the database A:
The first part of the algorithm solves the optimization problem by providing a solution called that corresponds here to an estimation of the joint distribution
in A.
From the first part, a nearest neighbor procedure is carried out as a second part to provide the individual predictions of in A: this procedure is implemented in the function
indiv_group_closest
.
In other words, this function sequentially assigns to each individual of A the modality of that is closest.
Obviously, this algorithm runs in the same way for the prediction of in the database B.
The function
indiv_grp_closest
integrates in its syntax the function avg_dist_closest
. Therefore, the related argument percent_closest
is identical in the two functions.
Thus, when computing average distances between an individual and a subset of individuals assigned to a same level of
or
is required, user can decide if all individuals from the subset of interest can participate to the computation (
percent_closest
=1) or only a fixed part p (<1) corresponding to the closest neighbors of (in this case
percent_closest
= p).
The arguments jointprobaA
and jointprobaB
correspond to the estimations of (sum of cells must be equal to 1) in A and/or B respectively, according to the
which.DB
argument.
For example, assuming that individuals are assigned to the first modality of
in A, the objective consists in the individual predictions of
in A. Then, if
jointprobaA
[1,2] = 0.10,
the maximum number of individuals that can be assigned to the second modality of in A, can not exceed
.
If
then all individuals assigned to the first modality of
will be assigned to the second modality of
.
At the end of the process, each individual with still no affectation will receive the same modality of
as those of his nearest neighbor in B.
A list of two vectors of numeric values:
YAtrans |
a vector corresponding to the individual predictions of |
ZBtrans |
a vector corresponding to the individual predictions of |
Gregory Guernec, Valerie Gares, Jeremy Omer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
proxim_dist
,avg_dist_closest
, ,OT_outcome
data(simu_data) ### Example with the Manhattan distance man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") ### Y(Yb1) and Z(Yb2) are a same information encoded in 2 different forms: ### (3 levels for Y and 5 levels for Z) ### ... Stored in two distinct databases, A and B, respectively ### The marginal distribution of Y in B is unknown, ### as the marginal distribution of Z in A ... # Empirical distribution of Y in database A: freqY <- prop.table(table(man1$Y)) freqY # Empirical distribution of Z in database B freqZ <- prop.table(table(man1$Z)) freqZ # By supposing that the following matrix called transport symbolizes # an estimation of the joint distribution L(Y,Z) ... # Note that, in reality this distribution is UNKNOWN and is # estimated in the OT function by resolving an optimisation problem. transport1 <- matrix(c(0.3625, 0, 0, 0.07083333, 0.05666667, 0, 0, 0.0875, 0, 0, 0.1075, 0, 0, 0.17166667, 0.1433333), ncol = 5, byrow = FALSE) # ... So that the marginal distributions of this object corresponds to freqY and freqZ: apply(transport1, 1, sum) # = freqY apply(transport1, 2, sum) # = freqZ # The affectation of the predicted values of Y in database B and Z in database A # are stored in the following object: pred_man1 <- indiv_grp_closest(mat_man1, jointprobaA = transport1, jointprobaB = transport1, percent_closest = 0.90 ) summary(pred_man1) # For the prediction of Z in A only, add the corresponding argument: pred_man1_A <- indiv_grp_closest(mat_man1, jointprobaA = transport1, jointprobaB = transport1, percent_closest = 0.90, which.DB = "A" )
data(simu_data) ### Example with the Manhattan distance man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") ### Y(Yb1) and Z(Yb2) are a same information encoded in 2 different forms: ### (3 levels for Y and 5 levels for Z) ### ... Stored in two distinct databases, A and B, respectively ### The marginal distribution of Y in B is unknown, ### as the marginal distribution of Z in A ... # Empirical distribution of Y in database A: freqY <- prop.table(table(man1$Y)) freqY # Empirical distribution of Z in database B freqZ <- prop.table(table(man1$Z)) freqZ # By supposing that the following matrix called transport symbolizes # an estimation of the joint distribution L(Y,Z) ... # Note that, in reality this distribution is UNKNOWN and is # estimated in the OT function by resolving an optimisation problem. transport1 <- matrix(c(0.3625, 0, 0, 0.07083333, 0.05666667, 0, 0, 0.0875, 0, 0, 0.1075, 0, 0, 0.17166667, 0.1433333), ncol = 5, byrow = FALSE) # ... So that the marginal distributions of this object corresponds to freqY and freqZ: apply(transport1, 1, sum) # = freqY apply(transport1, 2, sum) # = freqZ # The affectation of the predicted values of Y in database B and Z in database A # are stored in the following object: pred_man1 <- indiv_grp_closest(mat_man1, jointprobaA = transport1, jointprobaB = transport1, percent_closest = 0.90 ) summary(pred_man1) # For the prediction of Z in A only, add the corresponding argument: pred_man1_A <- indiv_grp_closest(mat_man1, jointprobaA = transport1, jointprobaB = transport1, percent_closest = 0.90, which.DB = "A" )
This function assigns individual predictions to the incomplete information of two integrated datasources by solving a linear optimization problem.
indiv_grp_optimal( proxim, jointprobaA, jointprobaB, percent_closest = 1, solvr = "glpk", which.DB = "BOTH" )
indiv_grp_optimal( proxim, jointprobaA, jointprobaB, percent_closest = 1, solvr = "glpk", which.DB = "BOTH" )
proxim |
a |
jointprobaA |
a matrix whose number of columns is equal to the number of modalities of the target variable |
jointprobaB |
a matrix whose number of columns is equal to the number of modalities of the target variable Y in database A, and whose number of rows is equal to the number of modalities of |
percent_closest |
a value between 0 and 1 (by default) corresponding to the fixed |
solvr |
a character string that specifies the type of method selected to solve the optimization algorithms. The default solver is "glpk". |
which.DB |
a character string that indicates which individual predictions are computed: only the individual predictions of |
A. THE RECODING PROBLEM IN DATA FUSION
Assuming that and
are two target variables which refered to the same target population in two separate databases A and B respectively (no overlapping rows),
so that
and
are never jointly observed. Assuming also that A and B share a subset of common covariates
of any types (same encodings in A and B)
completed or not. Merging these two databases often requires to solve a recoding problem by creating an unique database where
the missing information of
and
is fully completed.
B. DESCRIPTION OF THE FUNCTION
The function indiv_grp_optimal
is an intermediate function used in the implementation of an algorithm called OUTCOME
(and its enrichment R-OUTCOME
(2)) dedicated to the solving of recoding problems in data fusion using Optimal Transportation theory.
The model is implemented in the function OT_outcome
which integrates the function indiv_grp_optimal
in its syntax as a possible second step of the algorithm.
The function indiv_grp_optimal
can nevertheless be used separately providing that the argument proxim
receives an output object of the function proxim_dist
.
This latter is available in the package and is so directly usable beforehand.
The function indiv_grp_optimal
constitutes an alternative method to the nearest neighbor procedure implemented in the function indiv_grp_closest
.
As for the function indiv_grp_closest
, assuming that the objective consists in the prediction of in the database A, the first step of the algorithm related to
OUTCOME
provides an estimate of , the solution of the optimization problem, which can be seen, in this case as an estimation of the joint distribution
in A.
Rather than using a nearest neighbor approach to provide individual predictions, the function
indiv_grp_optimal
solves an optimization problem using the simplex algorithm which searches for the individual predictions of that minimize the computed total distance satisfying the joint probability distribution estimated in the first part.
More details about the theory related to the solving of this optimization problem is described in the section 5.3 of (2).
Obviously, this algorithm runs in the same way for the prediction of in the database B.
The function
indiv_grp_optimal
integrates in its syntax the function avg_dist_closest
and the related argument percent_closest
is identical in the two functions.
Thus, when computing average distances between an individual i and a subset of individuals assigned to a same level of or
is required, user can decide if all individuals from the subset of interest can participate to the computation (
percent_closest = 1
) or only a fixed part p (<1) corresponding to the closest neighbors of i (in this case percent_closest
= p).
The arguments jointprobaA
and jointprobaB
can be seen as estimations of (sum of cells must be equal to 1) that correspond to estimations of the joint distributions of
in A and B respectively.
The argument solvr
permits user to choose the solver of the optimization algorithm. The default solver is "glpk" that corresponds to the GNU Linear Programming Kit (see (3) for more details). The solver "clp" (see (4)) for Coin-or Linear Programming, convenient in linear and quadratic situations, is also directly integrated in the function.
Moreover, the function actually uses the R
optimization infrastructure of the package ROI which offers a wide choice of solver to users by easily loading the associated plugins of ROI (see (5)).
A list of two vectors of numeric values:
YAtrans |
a vector corresponding to the predicted values of |
ZBtrans |
a vector corresponding to the predicted values of |
Gregory Guernec, Valerie Gares, Jeremy Omer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
Makhorin A (2011). GNU Linear Programming Kit Reference Manual Version 4.47.http://www.gnu.org/software/glpk/
Forrest J, de la Nuez D, Lougee-Heimer R (2004). Clp User Guide. https://www.coin-or.org/Clp/userguide/index.html
Theussl S, Schwendinger F, Hornik K (2020). ROI: An Extensible R Optimization Infrastructure.Journal of Statistical Software,94(15), 1-64. doi:10.18637/jss.v094.i15
proxim_dist
, avg_dist_closest
, indiv_grp_closest
### Example using The Euclidean distance on a complete database # For this example we keep only 200 rows: data(tab_test) tab_test2 <- tab_test[c(1:80, 5001:5080), ] dim(tab_test2) # Adding NAs in Y1 and Y2 tab_test2[tab_test2$ident == 2, 2] <- NA tab_test2[tab_test2$ident == 1, 3] <- NA # Because all covariates are ordered in numeric form, # the transfo_dist function is not required here mat_testm <- proxim_dist(tab_test2, norm = "M") ### Y(Y1) and Z(Y2) are a same variable encoded in 2 different forms: ### 4 levels for Y1 and 3 levels for Y2 ### ... Stored in two distinct databases, A and B, respectively ### The marginal distribution of Y in B is unknown, ### as the marginal distribution of Z in A ... # Assuming that the following matrix called transport symbolizes # an estimation of the joint distribution L(Y,Z) ... # Note that, in reality this distribution is UNKNOWN and is # estimated in the OT function by resolving the optimization problem. # By supposing: val_trans <- c(0.275, 0.115, 0, 0, 0, 0.085, 0.165, 0, 0, 0, 0.095, 0.265) mat_trans <- matrix(val_trans, ncol = 3, byrow = FALSE) # Getting the individual predictions of Z in A (only) # by computing average distances on 90% of the nearest neighbors of # each modality of Z in B predopt_A <- indiv_grp_optimal(mat_testm, jointprobaA = mat_trans, jointprobaB = mat_trans, percent_closest = 0.90, which.DB = "A" ) ### Example 2 using The Manhattan distance with incomplete covariates data(simu_data) man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") ### Y and Z are a same variable encoded in 2 different forms: ### (3 levels for Y and 5 levels for Z) ### ... Stored in two distinct databases, A and B, respectively ### The marginal distribution of Y in B is unknown, ### as the marginal distribution of Z in A ... # By supposing that the following matrix called transport symbolizes # an estimation of the joint distribution L(Y,Z) ... # Note that, in reality this distribution is UNKNOWN and is # estimated in the OT function by resolving an optimisation problem. mat_trans2 <- matrix(c(0.3625, 0, 0, 0.07083333, 0.05666667, 0, 0, 0.0875, 0, 0, 0.1075, 0, 0, 0.17166667, 0.1433333), ncol = 5, byrow = FALSE) # The predicted values of Y in database B and Z in # database A are stored in the following object: predopt2 <- indiv_grp_optimal(mat_man1, jointprobaA = mat_trans2, jointprobaB = mat_trans2, percent_closest = 0.90 ) summary(predopt2)
### Example using The Euclidean distance on a complete database # For this example we keep only 200 rows: data(tab_test) tab_test2 <- tab_test[c(1:80, 5001:5080), ] dim(tab_test2) # Adding NAs in Y1 and Y2 tab_test2[tab_test2$ident == 2, 2] <- NA tab_test2[tab_test2$ident == 1, 3] <- NA # Because all covariates are ordered in numeric form, # the transfo_dist function is not required here mat_testm <- proxim_dist(tab_test2, norm = "M") ### Y(Y1) and Z(Y2) are a same variable encoded in 2 different forms: ### 4 levels for Y1 and 3 levels for Y2 ### ... Stored in two distinct databases, A and B, respectively ### The marginal distribution of Y in B is unknown, ### as the marginal distribution of Z in A ... # Assuming that the following matrix called transport symbolizes # an estimation of the joint distribution L(Y,Z) ... # Note that, in reality this distribution is UNKNOWN and is # estimated in the OT function by resolving the optimization problem. # By supposing: val_trans <- c(0.275, 0.115, 0, 0, 0, 0.085, 0.165, 0, 0, 0, 0.095, 0.265) mat_trans <- matrix(val_trans, ncol = 3, byrow = FALSE) # Getting the individual predictions of Z in A (only) # by computing average distances on 90% of the nearest neighbors of # each modality of Z in B predopt_A <- indiv_grp_optimal(mat_testm, jointprobaA = mat_trans, jointprobaB = mat_trans, percent_closest = 0.90, which.DB = "A" ) ### Example 2 using The Manhattan distance with incomplete covariates data(simu_data) man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") ### Y and Z are a same variable encoded in 2 different forms: ### (3 levels for Y and 5 levels for Z) ### ... Stored in two distinct databases, A and B, respectively ### The marginal distribution of Y in B is unknown, ### as the marginal distribution of Z in A ... # By supposing that the following matrix called transport symbolizes # an estimation of the joint distribution L(Y,Z) ... # Note that, in reality this distribution is UNKNOWN and is # estimated in the OT function by resolving an optimisation problem. mat_trans2 <- matrix(c(0.3625, 0, 0, 0.07083333, 0.05666667, 0, 0, 0.0875, 0, 0, 0.1075, 0, 0, 0.17166667, 0.1433333), ncol = 5, byrow = FALSE) # The predicted values of Y in database B and Z in # database A are stored in the following object: predopt2 <- indiv_grp_optimal(mat_man1, jointprobaA = mat_trans2, jointprobaB = mat_trans2, percent_closest = 0.90 ) summary(predopt2)
Harmonization and merging before data fusion of two databases with specific outcome variables and shared covariates.
merge_dbs( DB1, DB2, row_ID1 = NULL, row_ID2 = NULL, NAME_Y, NAME_Z, order_levels_Y = levels(DB1[, NAME_Y]), order_levels_Z = levels(DB2[, NAME_Z]), ordinal_DB1 = NULL, ordinal_DB2 = NULL, impute = "NO", R_MICE = 5, NCP_FAMD = 3, seed_choice = sample(1:1e+06, 1) )
merge_dbs( DB1, DB2, row_ID1 = NULL, row_ID2 = NULL, NAME_Y, NAME_Z, order_levels_Y = levels(DB1[, NAME_Y]), order_levels_Z = levels(DB2[, NAME_Z]), ordinal_DB1 = NULL, ordinal_DB2 = NULL, impute = "NO", R_MICE = 5, NCP_FAMD = 3, seed_choice = sample(1:1e+06, 1) )
DB1 |
a data.frame corresponding to the 1st database to merge (top database) |
DB2 |
a data.frame corresponding to the 2nd database to merge (bottom database) |
row_ID1 |
the column index of the row identifier of DB1 if it exists (no identifier by default) |
row_ID2 |
the column index of the row identifier of DB2 if it exists (no identifier by default) |
NAME_Y |
the name of the outcome (with quotes) in its specific scale/encoding from the 1st database (DB1) |
NAME_Z |
the name of the outcome (with quotes) in its specific scale/encoding from the 2nd database (DB2) |
order_levels_Y |
the levels of |
order_levels_Z |
the levels of |
ordinal_DB1 |
a vector of column indexes corresponding to ordinal variables in the 1st database (no ordinal variable by default) |
ordinal_DB2 |
a vector of column indexes corresponding to ordinal variables in the 2nd database (no ordinal variable by default) |
impute |
a character equals to "NO" when missing data on covariates are kept (Default option), "CC" for Complete Case by keeping only covariates with no missing information , "MICE" for MICE multiple imputation approach, "FAMD" for single imputation approach using Factorial Analysis for Mixed Data |
R_MICE |
the chosen number of multiple imputations required for the MICE approach (5 by default) |
NCP_FAMD |
an integer corresponding to the number of components used to predict missing values in FAMD imputation (3 by default) |
seed_choice |
an integer used as argument by the set.seed() for offsetting the random number generator (Random integer by default, only useful with MICE) |
Assuming that DB1 and DB2 are two databases (two separate data.frames with no overlapping rows) to be merged vertically before data fusion, the function merge_dbs
performs this merging and checks the harmonization of the shared variables.
Firslty, the two databases declared as input to the function (via the argument DB1
and DB2
) must have the same specific structure.
Each database must contain a target variable (whose label must be filled in the argument Y
for DB1 and in Z
for DB2 respectively, so that the final synthetic database in output will contain an incomplete variable Y
whose corresponding values will be missing in DB2 and another incomplete target Z
whose values will be missing in DB1), a subset of shared covariates (by example, the best predictors of in DB1, and
in DB2).
Each database can have a row identifier whose label must be assigned in the argument
row_ID1
for DB1 and row_ID2
for DB2. Nevertheless, by default DB1 and DB2 are supposed with no row identifiers. The merging keeps unchanged the order of rows in the two databases provided that and
have no missing values.
By building, the first declared database (in the argument
DB1
) will be placed automatically above the second one (declared in the argument DB2
) in the final database.
Firstly, by default, a variable with the same name in the two databases is abusively considered as shared. This condition is obviously insufficient to be kept in the final subset of shared variables,
and the function merge_dbs
so performs checks before merging described below.
A. Discrepancies between shared variables
Shared variables with discrepancies of types between the two databases (for example, a variable with a common name in the two databases but stored as numeric in DB1, and stored as character in DB2) will be removed from the merging and the variable name will be saved in output (REMOVE1
).
Shared factors with discrepancies of levels (or number of levels) will be also removed from the merging and the variable name will be saved in output (REMOVE2
).
covariates whose names are specific to each database will be also deleted from the merging.
If some important predictors have been improperly excluded from the merging due to the above-mentioned checks, it is possible for user to transform these variables a posteriori, and re-run the function.
B. Rules for the two outcomes (target variables)
The types of Y
and Z
must be suitable:
Categorical (ordered or not) factors are allowed.
Numeric and discrete outcomes with a finite number of values are allowed but will be automatically converted as ordered factors using the function transfo_target
integrated in the function merge_dbs
.
C. The function merge_dbs
handles incomplete information of shared variables, by respecting the following rules:
If Y
or Z
have missing values in DB1 or DB2, corresponding rows are excluded from the database before merging. Moreover, in the case of incomplete outcomes,
if A and B have row identifiers, the corresponding identifiers are removed and these latters are stored in the objects DB1_ID
and DB2_ID
of the output.
Before overlay, the function deals with incomplete covariates according to the argument impute
.
Users can decide to work with complete case only ("CC"), to keep ("NO") or impute incomplete information ("MICE","FAMD").
The function imput_cov
, integrated in the syntax of merge_dbs
deals with imputations. Two approaches are actually available:
the multivariate imputation by chained equation approach (MICE, see (3) for more details about the approach or the corresponding package mice),
and an imputation approach from the package missMDA that uses a dimensionality reduction method (here a factor analysis for mixed data called FAMD (4)), to provide single imputations.
If multiple imputation is required (impute
= "MICE"), the default imputation methods are applied according to the type of the variables. The average of the plausible values will be kept for a continuous variable, while the most frequent candidate will be kept as a consensus value for a categorical variable or factor (ordinal or not).
As a finally step, the function checks that all values related to in B are missing and inversely for
in A.
A list containing 12 elements (13 when impute
equals "MICE"):
DB_READY |
the database matched from the two initial databases with common covariates and imputed or not according to the impute option |
ID1_drop |
the row numbers or row identifiers excluded of the data merging because of the presence of missing values in the target variable of DB1. NULL otherwise |
ID2_drop |
the row numbers or row identifiers excluded of the data merging because of the presence of missing values in the target variable of DB2. NULL otherwise |
Y_LEVELS |
the remaining levels of the target variable |
Z_LEVELS |
the remaining Levels of the target variable |
REMOVE1 |
the labels of the deleted covariates because of type incompatibilies of type from DB1 to DB2 |
REMOVE2 |
the removed factor(s) because of levels incompatibilities from DB1 to DB2 |
REMAINING_VAR |
labels of the remained covariates for data fusion |
IMPUTE_TYPE |
a character with quotes that specify the method eventually chosen to handle missing data in covariates |
MICE_DETAILS |
a list containing the details of the imputed datasets using |
DB1_raw |
a data.frame corresponding to DB1 after merging |
DB2_raw |
a data.frame corresponding to DB2 after merging |
SEED |
an integer used as argument by the |
Gregory Guernec
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67. urlhttps://www.jstatsoft.org/v45/i03/
Josse J, Husson F (2016). missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1–31. doi:10.18637/jss.v070.i01
imput_cov
, transfo_target
, select_pred
### Assuming two distinct databases from simu_data: data_A and data_B ### Some transformations will be made beforehand on variables to generate ### heterogeneities between the two bases. data(simu_data) data_A <- simu_data[simu_data$DB == "A", c(2, 4:8)] data_B <- simu_data[simu_data$DB == "B", c(3, 4:8)] # For the example, a covariate is added (Weight) only in data_A data_A$Weight <- rnorm(300, 70, 5) # Be careful: the target variables must be in factor (or ordered) in the 2 databases # Because it is not the case for Yb2 in data_B, the function will convert it. data_B$Yb2 <- as.factor(data_B$Yb2) # Moreover, the Dosage covariate is stored in 3 classes in data_B (instead of 4 classes in data_B) # to make the encoding of this covariate specific to each database. data_B$Dosage <- as.character(data_B$Dosage) data_B$Dosage <- as.factor(ifelse(data_B$Dosage %in% c("Dos 1", "Dos 2"), "D1", ifelse(data_B$Dosage == "Dos 3", "D3", "D4") )) # For more diversity, this covariate iis placed at the last column of the data_B data_B <- data_B[, c(1:3, 5, 6, 4)] # Ex 1: The two databases are merged and incomplete covariates are imputed using MICE merged_ex1 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "MICE", R_MICE = 2, seed_choice = 3011) summary(merged_ex1$DB_READY) # Ex 2: The two databases are merged and missing values are kept merged_ex2 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "NO", seed_choice = 3011 ) # Ex 3: The two databases are merged by only keeping the complete cases merged_ex3 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "CC", seed_choice = 3011 ) # Ex 4: The two databases are merged and incomplete covariates are imputed using FAMD merged_ex4 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "FAMD", NCP_FAMD = 4, seed_choice = 2096 ) # Conclusion: # The data fusion is successful in each situation. # The Dosage and Weight covariates have been normally excluded from the fusion. # The covariates have been imputed when required.
### Assuming two distinct databases from simu_data: data_A and data_B ### Some transformations will be made beforehand on variables to generate ### heterogeneities between the two bases. data(simu_data) data_A <- simu_data[simu_data$DB == "A", c(2, 4:8)] data_B <- simu_data[simu_data$DB == "B", c(3, 4:8)] # For the example, a covariate is added (Weight) only in data_A data_A$Weight <- rnorm(300, 70, 5) # Be careful: the target variables must be in factor (or ordered) in the 2 databases # Because it is not the case for Yb2 in data_B, the function will convert it. data_B$Yb2 <- as.factor(data_B$Yb2) # Moreover, the Dosage covariate is stored in 3 classes in data_B (instead of 4 classes in data_B) # to make the encoding of this covariate specific to each database. data_B$Dosage <- as.character(data_B$Dosage) data_B$Dosage <- as.factor(ifelse(data_B$Dosage %in% c("Dos 1", "Dos 2"), "D1", ifelse(data_B$Dosage == "Dos 3", "D3", "D4") )) # For more diversity, this covariate iis placed at the last column of the data_B data_B <- data_B[, c(1:3, 5, 6, 4)] # Ex 1: The two databases are merged and incomplete covariates are imputed using MICE merged_ex1 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "MICE", R_MICE = 2, seed_choice = 3011) summary(merged_ex1$DB_READY) # Ex 2: The two databases are merged and missing values are kept merged_ex2 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "NO", seed_choice = 3011 ) # Ex 3: The two databases are merged by only keeping the complete cases merged_ex3 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "CC", seed_choice = 3011 ) # Ex 4: The two databases are merged and incomplete covariates are imputed using FAMD merged_ex4 <- merge_dbs(data_A, data_B, NAME_Y = "Yb1", NAME_Z = "Yb2", ordinal_DB1 = c(1, 4), ordinal_DB2 = c(1, 6), impute = "FAMD", NCP_FAMD = 4, seed_choice = 2096 ) # Conclusion: # The data fusion is successful in each situation. # The Dosage and Weight covariates have been normally excluded from the fusion. # The covariates have been imputed when required.
This database is a sample of the first four waves of data collection of the National Child Development Study (NCDS) started in 1958 (https://cls.ucl.ac.uk/cls-studies/1958-national-child-development-study/). The NCDS project is a continuing survey which follows the lives of over 17,000 people born in England, Scotland and Wales in a same week of the year 1958.
ncds_14
ncds_14
A data.frame with 5,476 participants (rows) and 6 variables
the anonymised ncds identifier
the Goldthorp social class 90 scale coded as a 12-levels factor: higher-grade professionals 10
,
lower-grade professionals 20
, routine non-manual employees with higher grade (administration, commerce) 31
,
routine non-manual employees with lower grade (sales and services) 32
, small proprietors with employees 41
,
small proprietors without employees 42
, farmers, small holders and workers in primary production 43
,
lower-grade technicians 50
, skilled manual workers 60
, semi-skilled and unskilled manual workers 71
,
other workers in primary production 72
, and 0
when the scale was not applicable to the participant. This variable has 806 NAs.
the health status of the participant stored in a 4 ordered levels factor: 1
for excellent,
2
for good, 3
for fair, 4
for poor. This variable has 2 NAs.
the employment status at inclusion stored in a 7-levels factor: 1
for unemployed status, 2
for govt sheme,
3
for full-time education, 4
for housework or childcare, 5
for sick or handicapped, 6
for other, 7
if employed between 16 and 33. This variable has 58 NAs.
the gender of the participant stored in a 2-levels factor: 1
for male, 2
for female
a 2-level factor equals to 1
for participant with completed graduate studies or 2
otherwise
The ncds identifier have been voluntarily anonymized to allow their availability for the package.
This sample has 5,476 participants included in the study between the first and fourth wave of data collection.
INSERM - This database is a sample of the National Child Development Study
This database is a sample of the fifth wave of data collection of the National Child Development Study (NCDS) started in 1958 (https://cls.ucl.ac.uk/cls-studies/1958-national-child-development-study/). The NCDS project is a continuing survey which follows the lives of over 17,000 people born in England, Scotland and Wales in a same week of the year 1958.
ncds_5
ncds_5
A data.frame with 365 participants (rows) and 6 variables
the anonymised ncds identifier
the gender of the participant stored in a 2-levels factor: 1
for male, 2
for female
the RG social class 91 scale coded as a 7-levels factor: 10
for professional educations,
20
for managerial and technical occupations, 31
for skilled non-manual occupations,
32
for skilled manual occupations, 40
for party-skilled occupations, 50
for unskilled occupations 50
,
and 0
when the scale was not applicable to the participant. This variable is complete.
the health status of the participant stored in a 4 ordered levels factor: 1
for excellent,
2
for good, 3
for fair, 4
for poor. This variable has 2 NAs.
the employment status at inclusion stored in a 7-levels factor: 1
for unemployed status, 2
for govt sheme,
3
for full-time education, 4
for housework or childcare, 5
for sick or handicapped, 6
for other, 7
if employed between 16 and 33. This variable has 58 NAs.
a 2-level factor equals to 1
for participant with completed graduate studies or 2
otherwise
The ncds identifier have been voluntarily anonymized to allow their availability for the package.
This sample has 365 participants included in the study during the 5th waves of data collection.
INSERM - This database is a sample of the National Child Development Study
The function OT_joint
integrates two algorithms called (JOINT
) and (R-JOINT
) dedicated to the solving of recoding problems in data fusion
using optimal transportation of the joint distribution of outcomes and covariates.
OT_joint( datab, index_DB_Y_Z = 1:3, nominal = NULL, ordinal = NULL, logic = NULL, convert.num = NULL, convert.class = NULL, dist.choice = "E", percent.knn = 1, maxrelax = 0, lambda.reg = 0, prox.X = 0.3, solvR = "glpk", which.DB = "BOTH" )
OT_joint( datab, index_DB_Y_Z = 1:3, nominal = NULL, ordinal = NULL, logic = NULL, convert.num = NULL, convert.class = NULL, dist.choice = "E", percent.knn = 1, maxrelax = 0, lambda.reg = 0, prox.X = 0.3, solvR = "glpk", which.DB = "BOTH" )
datab |
a data.frame made up of two overlayed databases with at least four columns sorted in a random order. One column must be a column dedicated to the identification of the two databases ranked in ascending order
(For example: 1 for the top database and 2 for the database from below, or more logically here A and B ...But not B and A!). One column ( |
index_DB_Y_Z |
a vector of three indexes of variables. The first index must correspond to the index of the databases identifier column. The second index corresponds to the index of the target variable in the first database (A) while the third index corresponds to the column index related to the target variable in the second database (B). |
nominal |
a vector of column indexes of all the nominal (not ordered) variables (database identifier and target variables included if it is the case for them). |
ordinal |
a vector of column indexes of all the ordinal variables (database identifier and target variables included if it is the case for them). |
logic |
a vector of column indexes of all the boolean variables of the data.frame. |
convert.num |
indexes of the continuous (quantitative) variables. They will be automatically converted in ordered factors. By default, no continuous variables is assumed in the database. |
convert.class |
a vector indicating for each continuous variable to convert, the corresponding desired number of levels. If the length of the argument |
dist.choice |
a character string (with quotes) corresponding to the distance function chosen between: the euclidean distance ("E", by default), the Manhattan distance ("M"), the Gower distance ("G"), and the Hamming distance ("H") for binary covariates only. |
percent.knn |
the ratio of closest neighbors involved in the computations of the cost matrices. 1 is the default value that includes all rows in the computation. |
maxrelax |
the maximum percentage of deviation from expected probability masses. It must be equal to 0 (default value) for the |
lambda.reg |
a coefficient measuring the importance of the regularization term. It corresponds to the |
prox.X |
a probability (betwen 0 and 1) used to calculate the distance threshold below which two covariates' profiles are supposed as neighbors.
If |
solvR |
a character string that specifies the type of method selected to solve the optimization algorithms. The default solver is "glpk". |
which.DB |
a character string indicating the database to complete ("BOTH" by default, for the prediction of |
A. THE RECODING PROBLEM IN DATA FUSION
Assuming that and
are two target variables which refered to the same target population in two separate databases A and B respectively (no overlapping rows),
so that
and
are never jointly observed. Assuming also that A and B share a subset of common covariates
of any types (same encodings in A and B)
completed or not. Merging these two databases often requires to solve a recoding problem by creating an unique database where
the missing information of
and
is fully completed.
B. INFORMATIONS ABOUT THE ALGORITHM
As with the function OT_outcome
, the function OT_joint
provides a solution to the recoding problem by proposing an
application of optimal transportation which aims is to search for a bijective mapping between the joint distributions of and
in A and B (see (2) for more details).
The principle of the algorithm is also based on the resolution of an optimization problem, which provides a solution
(as called in (1) and (2)), estimate
of the joint distribution of
according to the database to complete (see the argument
which.DB
for the choice of the database). While the algorithms OUTCOME
and R_OUTCOME
integrated in
the function OT_outcome
require post-treatment steps to provide individual predictions, the algorithm JOINT
directly uses estimations of the conditional distributions in B and
in A to predict the corresponding incomplete individuals informations of
and/or
respectively.
This algorithm supposes that the conditional distribution
must be identical in A and B. Respectively,
is supposed identical in A and B.
Estimations a posteriori of conditional probabilities
and
are available for each profiles of covariates in output (See the objects
estimatorYB
and estimatorZA
).
Estimations of are also available according to the chosen transport distributions (See the arguments
gamma_A
and gamma_B
).
The algorithm R-JOINT
gathers enrichments of the algorithm JOINT
and is also available via the function OT_joint
. It allows users to add a relaxation term in the algorithm to relax distributional assumptions (maxrelax
>0),
and (or) add also a positive regularization term (lamdba.reg
>0) expressing that the transportation map does not vary to quickly with respect of covariates .
Is is suggested to users to calibrate these two parameters a posteriori by studying the stability of the individual predictions in output.
C. EXPECTED STRUCTURE FOR THE INPUT DATABASE
The input database is a data.frame that must satisfy a specific form:
Two overlayed databases containing a common column of databases identifiers (A and B, 1 or 2, by examples, encoded in numeric or factor form)
A column corresponding to the target variable with its specific encoding in A (For example a factor encoded in
levels, ordered or not, with NAs in the corresponding rows of B)
A column corresponding to another target outcome summarizing the same latent information with its specific encoding in B (By example a factor with
levels, with NAs in rows of A)
The order of the variables in the database have no importance but the column indexes related to the three columns previously described (ie ID, and
) must be rigorously specified
in the argument
index_DB_Y_Z
.
A set of shared common categorical covariates (at least one but more is recommended) with or without missing values (provided that the number of covariates exceeds 1) is required. On the contrary to the
function OT_outcome
, please notice, that the function OT_joint
does not accept continuous covariates therefore these latters will have to be categorized beforehand or using the provided input process (see convert.num
).
The function merge_dbs
is available in this package to assist user in the preparation of their databases.
Remarks about the target variables:
A target variable can be of categorical type, but also discrete, stored in factor, ordered or not. Nevertheless, notice that, if the variable is stored in numeric it will be automatically converted in ordered factors.
If a target variable is incomplete, the corresponding rows will be automatically dropped during the execution of the function.
The type of each variables (including ,
and
) of the database must be rigorously specified, in one of the four arguments
quanti
, nominal
, ordinal
and logic
.
D. TRANSFORMATIONS OF CONTINUOUS COVARIATES
Continuous shared variables (predictors) with infinite numbers of values have to be categorized before being introduced in the function.
To assist users in this task, the function OT_joint
integrates in its syntax a process dedicated to the categorization of continuous covariates. For this, it is necessary to rigorously fill in
the arguments quanti
and convert.class
.
The first one informs about the column indexes of the continuous variables to be transformed in ordered factor while the second one specifies the corresponding number of desired balanced levels (for unbalanced levels, users must do transformations by themselves).
Therefore convert.num
and convert.class
must be vectors of same length, but if the length of quanti
exceeds 1, while the length of convert.class
is 1, then, by default, all the covariates to convert will have the same number of classes (transformation by quantiles),
that corresponds to the value specified in the argument convert.class
.
Notice that only covariates can be transformed (not target variables) and that any incomplete information must have been taken into account beforehand (via the dedicated functions merge_dbs
or imput_cov
for examples).
Moreover, all the indexes informed in the argument convert.num
must also be informed in the argument quanti
.
Finally, it is recommended to declare all discrete covariates as ordinal factors using the argument ordinal
.
E. INFORMATIONS ABOUT DISTANCE FUNCTIONS AND RELATED PARAMETERS
Each individual (or row) of a given database is here characterized by a vector of covariates, so the distance between two individuals or groups of individuals depends on similarities between covariates
according to the distance function chosen by user (via the argument dist.choice
). Actually four distance functions are implemented in OT_joint
to take into account the most frequently encountered situation (see (3)):
the Manhattan distance ("M")
the Euclidean distance ("E")
the Gower distance for mixed data (see (4): "G")
the Hamming distance for binary data ("H")
Finally, two profiles of covariates (
individuals) and
(
individuals) will be considered as neighbors if
where
must be fixed by user (
and
). This choice is used in the computation of the
JOINT
and R_JOINT
algorithms.
The prox.X
argument influences a lot the running time of the algorithm. The greater, the more the value will be close to 1, the more the convergence of the algorithm will be difficult or even impossible.
Each individual from A or B is here considered as a neighbor of only one profile of covariates
.
F. INFORMATIONS ABOUT THE SOLVER
The argument solvR
permits user to choose the solver of the optimization algorithm. The default solver is "glpk" that corresponds to the GNU Linear Programming Kit (see (5) for more details).
Moreover, the function actually uses the R
optimization infrastructure of the package ROI which offers a wide choice of solver to users by easily loading the associated plugins of ROI (see (6)).
For more details about the algorithms integrated in OT_joint
, please consult (2).
A "otres" class object of 9 elements:
time_exe |
running time of the function |
gamma_A |
estimate of |
gamma_B |
estimate of |
profile |
a data.frame that gives all details about the remaining P profiles of covariates. These informations can be linked to the |
res_prox |
a |
estimatorZA |
an array that corresponds to estimates of the probability distribution of |
estimatorYB |
an array that corresponds to estimates of the probability distribution of |
DATA1_OT |
the database A with the individual predictions of |
DATA2_OT |
the database B with the individual predictions of |
Gregory Guernec, Valerie Gares, Jeremy Omer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
Anderberg, M.R. (1973), Cluster analysis for applications, 359 pp., Academic Press, New York, NY, USA.
Gower J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 623–637
Makhorin A (2011). GNU Linear Programming Kit Reference Manual Version 4.47.http://www.gnu.org/software/glpk/
Theussl S, Schwendinger F, Hornik K (2020). ROI: An Extensible R Optimization Infrastructure.Journal of Statistical Software,94(15), 1-64. doi:10.18637/jss.v094.i15
merge_dbs
, OT_outcome
, proxim_dist
, avg_dist_closest
### An example of JOINT algorithm with: #----- # - A sample of the database tab_test # - Y1 and Y2 are a 2 outcomes encoded in 2 different forms in DB 1 and 2: # 4 levels for Y1 and 3 levels for Y2 # - n1 = n2 = 40 # - 2 discrete covariates X1 and X2 defined as ordinal # - Distances estimated using the Gower function # Predictions are assessed for Y1 in B only #----- data(tab_test) tab_test2 <- tab_test[c(1:40, 5001:5040), 1:5] OUTJ1_B <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "G", which.DB = "B" ) ### An example of R-JOINT algorithm using the previous database, ### and keeping the same options excepted for: #----- # - The distances are estimated using the Gower function # - Inclusion of an error term in the constraints on # the marginals (relaxation term) # Predictions are assessed for Y1 AND Y2 in A and B respectively #----- R_OUTJ1 <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "G", maxrelax = 0.4, which.DB = "BOTH" ) ### The previous example of R-JOINT algorithm with: # - Adding a regularization term # Predictions are assessed for Y1 and Y2 in A and B respectively #----- R_OUTJ2 <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "G", maxrelax = 0.4, lambda.reg = 0.9, which.DB = "BOTH" ) ### Another example of JOINT algorithm with: #----- # - A sample of the database simu_data # - Y1 and Y2 are a 2 outcomes encoded in 2 different forms in DB A and B: # (3 levels for Y and 5 levels for Z) # - n1 = n2 = 100 # - 3 covariates: Gender, Smoking and Age in a qualitative form # - Complete Case study # - The Hamming distance # Predictions are assessed for Y1 and Y2 in A and B respectively #----- data(simu_data) simu_data2 <- simu_data[c(1:100, 401:500), c(1:4, 7:8)] simu_data3 <- simu_data2[!is.na(simu_data2$Age), ] OUTJ2 <- OT_joint(simu_data3, prox.X = 0.10, convert.num = 6, convert.class = 3, nominal = c(1, 4:5), ordinal = 2:3, dist.choice = "H", which.DB = "B" )
### An example of JOINT algorithm with: #----- # - A sample of the database tab_test # - Y1 and Y2 are a 2 outcomes encoded in 2 different forms in DB 1 and 2: # 4 levels for Y1 and 3 levels for Y2 # - n1 = n2 = 40 # - 2 discrete covariates X1 and X2 defined as ordinal # - Distances estimated using the Gower function # Predictions are assessed for Y1 in B only #----- data(tab_test) tab_test2 <- tab_test[c(1:40, 5001:5040), 1:5] OUTJ1_B <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "G", which.DB = "B" ) ### An example of R-JOINT algorithm using the previous database, ### and keeping the same options excepted for: #----- # - The distances are estimated using the Gower function # - Inclusion of an error term in the constraints on # the marginals (relaxation term) # Predictions are assessed for Y1 AND Y2 in A and B respectively #----- R_OUTJ1 <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "G", maxrelax = 0.4, which.DB = "BOTH" ) ### The previous example of R-JOINT algorithm with: # - Adding a regularization term # Predictions are assessed for Y1 and Y2 in A and B respectively #----- R_OUTJ2 <- OT_joint(tab_test2, nominal = c(1, 4:5), ordinal = c(2, 3), dist.choice = "G", maxrelax = 0.4, lambda.reg = 0.9, which.DB = "BOTH" ) ### Another example of JOINT algorithm with: #----- # - A sample of the database simu_data # - Y1 and Y2 are a 2 outcomes encoded in 2 different forms in DB A and B: # (3 levels for Y and 5 levels for Z) # - n1 = n2 = 100 # - 3 covariates: Gender, Smoking and Age in a qualitative form # - Complete Case study # - The Hamming distance # Predictions are assessed for Y1 and Y2 in A and B respectively #----- data(simu_data) simu_data2 <- simu_data[c(1:100, 401:500), c(1:4, 7:8)] simu_data3 <- simu_data2[!is.na(simu_data2$Age), ] OUTJ2 <- OT_joint(simu_data3, prox.X = 0.10, convert.num = 6, convert.class = 3, nominal = c(1, 4:5), ordinal = 2:3, dist.choice = "H", which.DB = "B" )
The function OT_outcome
integrates two algorithms called (OUTCOME
) and (R-OUTCOME
) dedicated to the solving of recoding problems in data fusion
using optimal transportation (OT) of the joint distribution of outcomes.
OT_outcome( datab, index_DB_Y_Z = 1:3, quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL, convert.num = NULL, convert.class = NULL, FAMD.coord = "NO", FAMD.perc = 0.8, dist.choice = "E", percent.knn = 1, maxrelax = 0, indiv.method = "sequential", prox.dist = 0, solvR = "glpk", which.DB = "BOTH" )
OT_outcome( datab, index_DB_Y_Z = 1:3, quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL, convert.num = NULL, convert.class = NULL, FAMD.coord = "NO", FAMD.perc = 0.8, dist.choice = "E", percent.knn = 1, maxrelax = 0, indiv.method = "sequential", prox.dist = 0, solvR = "glpk", which.DB = "BOTH" )
datab |
a data.frame made up of two overlayed databases with at least four columns sorted in a random order. One column must be a column dedicated to the identification of the two databases ranked in ascending order
(For example: 1 for the top database and 2 for the database from below, or more logically here A and B ...But not B and A!). One column ( |
index_DB_Y_Z |
a vector of three indexes of variables. The first index must correspond to the index of the databases identifier column. The second index corresponds to the index of the target variable in the first database (A) while the third index corresponds to the column index related to the target variable in the second database (B). |
quanti |
a vector of column indexes of all the quantitative variables (database identifier and target variables included if it is the case for them). |
nominal |
a vector of column indexes of all the nominal (not ordered) variables (database identifier and target variables included if it is the case for them). |
ordinal |
a vector of column indexes of all the ordinal variables (database identifier and target variables included if it is the case for them). |
logic |
a vector of column indexes of all the boolean variables of the data.frame. |
convert.num |
indexes of the continuous (quantitative) variables to convert in ordered factors if necessary. All declared indexes in this argument must have been declared in the argument |
convert.class |
a vector indicating for each continuous variable to convert, the corresponding desired number of levels. If the length of the argument |
FAMD.coord |
a logical that must be set to TRUE when user decides to work with principal components of a factor analysis for mixed data (FAMD) instead of the set of raw covariates (FALSE is the default value). |
FAMD.perc |
a percent (between 0 and 1) linked to the |
dist.choice |
a character string (with quotes) corresponding to the distance function chosen between: the euclidean distance ("E", by default), The Manhattan distance ("M"), the Gower distance ("G"), the Hamming distance ("H") for binary covariates only, and the Euclidean or Manhattan distance computed from principal components of a factor analysis of mixed data ("FAMD"). See (1) for details. |
percent.knn |
the ratio of closest neighbors involved in the computations of the cost matrices. 1 is the default value that includes all rows in the computation. |
maxrelax |
the maximum percentage of deviation from expected probability masses. It must be equal to 0 (default value) for the |
indiv.method |
a character string indicating the chosen method to get individual predictions from the joint probabilities assessed, "sequential" by default, or "optimal". See the |
prox.dist |
a probability (between 0 and 1) used to calculate the distance threshold below which an individual (a row) is considered as a neighbor of a given profile of covariates. When shared variables are all factors or categorical, it is suggested to keep this option to 0. |
solvR |
a character string that specifies the type of method selected to solve the optimization algorithms. The default solver is "glpk". |
which.DB |
a character string indicating the database to complete ("BOTH" by default, for the prediction of |
A. THE RECODING PROBLEM IN DATA FUSION
Assuming that and
are two target variables which refered to the same target population in two separate databases A and B respectively (no overlapping rows),
so that
and
are never jointly observed. Assuming also that A and B share a subset of common covariates
of any types (same encodings in A and B)
completed or not. Merging these two databases often requires to solve a recoding problem by creating an unique database where
the missing information of
and
is fully completed.
B. INFORMATIONS ABOUT THE ALGORITHM
The algorithm integrated in the function OT_outcome
provides a solution to the recoding problem previously described by proposing an
application of optimal transportation which aims is to search for a bijective mapping between the distributions of of in A and
in B.
Mathematically, the principle of the algorithm is based on the resolution of an optimization problem which provides an optimal solution
(as called in the related articles)
that transfers the distribution of
in A to the distribution of
in B (or conversely, according to the sense of the transport)and can be so interpreted as an estimator of the joint distribution
in A (or B respetively). According to this result, a second step of the algorithm provides individual predictions of
in B (resp. of
in A, or both, depending on the choice
specified by user in the argument
which.DB
). Two possible approaches are available depending on the argument indiv.method
:
When indiv.method = "sequential"
, a nearest neighbor procedure is applied. This corresponds to the use of the function indiv_grp_closest
implemented in the function OT_outcome
.
When indiv.method = "optimal"
, a linear optimization problem is solved to determine the individual predictions that minimize the sum of the individual distances
in A (resp. in B) with the modalities of in B (resp.
in A). This approach is applied via the function
indiv_grp_optimal
implemented in the function OT_outcome
.
This algorithm supposes the respect of the two following assumptions:
must follow the same distribution in A and B. In the same way,
follows the same distribution in the two databases.
The conditional distribution must be identical in A and B. Respectively,
is supposed identical in A and B.
Because the first assumption can be too strong in some situations, a relaxation of the constraints of marginal distribution is possible using the argument maxrelax
.
When indiv.method = "sequential"
and maxrelax = 0
, the algorithm called OUTCOME
(see (1) and (2))
is applied. In all other situations, the algorithm applied corresponds to an algorithm called R_OUTCOME
(see (2)).
A posteriori estimates of conditional probabilities and
are available for each profile of covariates (see the output objects
estimatorYB
and estimatorZA
).
Estimates of are also available according to the desired direction of the transport (from A to B and/or conversely. See
and
).
C. EXPECTED STRUCTURE FOR THE INPUT DATABASE
The input database is a data.frame that must be saved in a specific form by users:
Two overlayed databases containing a common column of database identifiers (A and B, 1 or 2, by examples, encoded in numeric or factor form)
A column corresponding to the target variable with its specific encoding in A (For example a factor encoded in
levels, ordered or not, with NAs in the corresponding rows of B)
A column corresponding to the second target outcome with its specific endoded in B (For example a factor in
levels, with NAs in rows of A)
The order of the variables in the database have no importance but the column indexes related to the three columns previously described (ie ID, and
) must be rigorously specified
in the argument
index_DB_Y_Z
.
A set of shared common covariates (at least one but more is recommended) of any type, complete or not (provided that the number of covariates exceeds 1) is required.
The function merge_dbs
is available in this package to assist user in the preparation of their databases, so please, do not hesitate to use it beforehand if necessary.
Remarks about the target variables:
A target variable can be of categorical type, but also discrete, stored in factor, ordered or not. Nevertheless, notice that, if the variable is stored in numeric it will be automatically converted in ordered factors.
If a target outcome is incomplete, the corresponding rows will be automatically dropped during the execution of the function.
The type of each variables (including ,
and
) of the database must be rigorously specified once, in one of the four arguments
quanti
,nominal
, ordinal
and logic
.
D. TRANSFORMATIONS OF CONTINUOUS COVARIATES
The function OT_outcome
integrates in its syntax a process dedicated to the categorization of continuous covariates. For this, it is necessary to rigorously fill in the arguments convert.num
and convert.class
.
The first one informs about the indexes in database of the continuous variables to transform in ordered factor while the second one specifies the corresponding number of desired balanced levels (for unbalanced levels, users must do transformations by themselves).
Therefore convert.num
and convert.class
must be vectors of same length, but if the length of convert.num
exceeds 1, while the length of convert.class
is 1, then, by default, all the covariates to convert will have the same number of classes,
that corresponds to the value specified in the argument convert.class
.
Please notice that only covariates can be transformed (not outcomes) and missing informations are not taken into account for the transformations.
Moreover, all the indexes informed in the argument convert.num
must also be informed in the argument quanti
.
E. INFORMATIONS ABOUT DISTANCE FUNCTIONS
Each individual (or row) of a given database is here characterized by their covariates, so the distance between two individuals or groups of individuals depends on similarities between covariates
according to the distance function chosen by user (via the argument dist.choice
). Actually four distance functions are implemented in OT_outcome
to take into account the most frequently encountered situation (see (3)):
the Manhattan distance ("M")
the Euclidean distance ("E")
the Gower distance for mixed data (see (4): "G")
the Hamming distance for binary data ("H")
Moreover, it is also possible to directly apply the first three distances mentioned on coordinates extracted from a multivariate analysis (Factor Analysis for Mixed Data, see (5)) applied on raw covariates using the arguments FAMD.coord
and FAMD.perc
.
This method is used (1).
As a decision rule, for a given profile of covariates , an individual
will be considered as a neighbor of
if
where
must be fixed by user.
F. INFORMATIONS ABOUT THE SOLVER
The argument solvR
permits user to choose the solver of the optimization algorithm. The default solver is "glpk" that corresponds to the GNU Linear Programming Kit (see (6) for more details).
Moreover, the function actually uses the R
optimization infrastructure of the package ROI which offers a wide choice of solver to users by easily loading the associated plugins of ROI (see (7)).
For more details about the algorithms integrated in OT_outcome
, please consult (1) and (2).
A "otres" class object of 9 elements:
time_exe |
the running time of the function |
gamma_A |
a matrix corresponding to an estimation of the joint distribution of |
gamma_B |
a matrix corresponding to an estimation of the joint distribution of |
profile |
a data.frame that gives all details about the remaining |
res_prox |
the outputs of the function |
estimatorZA |
an array that corresponds to estimates of the probability distribution of |
estimatorYB |
an array that corresponds to estimates of the probability distribution of |
DATA1_OT |
the database A with the individual predictions of |
DATA2_OT |
the database B with the individual predictions of |
Gregory Guernec, Valerie Gares, Jeremy Omer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
Anderberg, M.R. (1973), Cluster analysis for applications, 359 pp., Academic Press, New York, NY, USA.
Gower J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 623–637.
Pages J. (2004). Analyse factorielle de donnees mixtes. Revue Statistique Appliquee. LII (4). pp. 93-111.
Makhorin A (2011). GNU Linear Programming Kit Reference Manual Version 4.47.http://www.gnu.org/software/glpk/
Theussl S, Schwendinger F, Hornik K (2020). ROI: An Extensible R Optimization Infrastructure.Journal of Statistical Software,94(15), 1-64. doi:10.18637/jss.v094.i15
transfo_dist
,proxim_dist
, avg_dist_closest
, indiv_grp_closest
, indiv_grp_optimal
### Using a sample of simu_data dataset ### Y and Z are a same variable encoded in 2 different forms: ### (3 levels for Y and 5 levels for Z) #-------- data(simu_data) simu_dat <- simu_data[c(1:200, 301:500), ] ### An example of OUTCOME algorithm that uses: #----- # - A nearest neighbor procedure for the estimation of individual predictions # - The Manhattan distance function # - 90% of individuals from each modalities to calculate average distances # between individuals and modalities # Predictions are assessed for Y in B and Z in A #----- OUTC1 <- OT_outcome(simu_dat, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "M", maxrelax = 0, indiv.method = "sequential" ) head(OUTC1$DATA1_OT) # Part of the completed database A head(OUTC1$DATA2_OT) # Part of the completed database B head(OUTC1$estimatorZA[, , 1]) # ... Corresponds to P[Z = 1|Y,P1] when P1 corresponds to the 1st profile of covariates (P_1) # detailed in the 1st row of the profile object: OUTC1$profile[1, ] # Details of P_1 # So estimatorZA[1,1,1]= 0.2 corresponds to an estimation of: # P[Z = 1|Y=[20-40],Gender_2=0,Treatment_2=1,Treatment_3=0,Smoking_2=1,Dosage=3,Age=65.44] # Thus, we can conclude that all individuals with the P_1 profile of covariates have # 20% of chance to be affected to the 1st level of Z in database A. # ... And so on, the reasoning is the same for the estimatorYB object. ### An example of OUTCOME algorithm with same conditions as the previous example, excepted that; # - Only the individual predictions of Y in B are required # - The continuous covariates "age" (related index = 8) will be converted in an ordinal factors # of 3 balanced classes (tertiles) # - The Gower distance is now used ### ----- OUTC2_B <- OT_outcome(simu_dat, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "G", maxrelax = 0, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "B" ) ### An example of OUTCOME algorithm with same conditions as the first example, excepted that; # - Only the individual predictions of Z in A are required # - The continuous covariates "age" (related index = 8) will be converted in an ordinal factors # of 3 balanced classes (tertiles) # - Here, the Hamming distance can be applied because, after conversion, all covariates are factors. # Disjunctive tables of each covariates will be automatically used to work with a set of binary # variables. ### ----- OUTC3_B <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "H", maxrelax = 0, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "B" ) ### An example of R-OUTCOME algorithm using: # - An optimization procedure for individual predictions on the 2 databases # - The Manhattan distance # - Raw covariates ### ----- R_OUTC1 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "M", maxrelax = 0, indiv.method = "optimal" ) ### An example of R-OUTCOME algorithm with: # - An optimization procedure for individual predictions on the 2 databases # - The use of Euclidean distance on coordinates from FAMD # - Raw covariates ### ----- R_OUTC2 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "E", FAMD.coord = "YES", FAMD.perc = 0.8, indiv.method = "optimal" ) ### An example of R-OUTCOME algorithm with relaxation on marginal distributions and: # - An optimization procedure for individual predictions on the 2 databases # - The use of the euclidean distance # - An arbitrary coefficient of relaxation # - Raw covariates #----- R_OUTC3 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "E", maxrelax = 0.4, indiv.method = "optimal" )
### Using a sample of simu_data dataset ### Y and Z are a same variable encoded in 2 different forms: ### (3 levels for Y and 5 levels for Z) #-------- data(simu_data) simu_dat <- simu_data[c(1:200, 301:500), ] ### An example of OUTCOME algorithm that uses: #----- # - A nearest neighbor procedure for the estimation of individual predictions # - The Manhattan distance function # - 90% of individuals from each modalities to calculate average distances # between individuals and modalities # Predictions are assessed for Y in B and Z in A #----- OUTC1 <- OT_outcome(simu_dat, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "M", maxrelax = 0, indiv.method = "sequential" ) head(OUTC1$DATA1_OT) # Part of the completed database A head(OUTC1$DATA2_OT) # Part of the completed database B head(OUTC1$estimatorZA[, , 1]) # ... Corresponds to P[Z = 1|Y,P1] when P1 corresponds to the 1st profile of covariates (P_1) # detailed in the 1st row of the profile object: OUTC1$profile[1, ] # Details of P_1 # So estimatorZA[1,1,1]= 0.2 corresponds to an estimation of: # P[Z = 1|Y=[20-40],Gender_2=0,Treatment_2=1,Treatment_3=0,Smoking_2=1,Dosage=3,Age=65.44] # Thus, we can conclude that all individuals with the P_1 profile of covariates have # 20% of chance to be affected to the 1st level of Z in database A. # ... And so on, the reasoning is the same for the estimatorYB object. ### An example of OUTCOME algorithm with same conditions as the previous example, excepted that; # - Only the individual predictions of Y in B are required # - The continuous covariates "age" (related index = 8) will be converted in an ordinal factors # of 3 balanced classes (tertiles) # - The Gower distance is now used ### ----- OUTC2_B <- OT_outcome(simu_dat, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "G", maxrelax = 0, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "B" ) ### An example of OUTCOME algorithm with same conditions as the first example, excepted that; # - Only the individual predictions of Z in A are required # - The continuous covariates "age" (related index = 8) will be converted in an ordinal factors # of 3 balanced classes (tertiles) # - Here, the Hamming distance can be applied because, after conversion, all covariates are factors. # Disjunctive tables of each covariates will be automatically used to work with a set of binary # variables. ### ----- OUTC3_B <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "H", maxrelax = 0, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "B" ) ### An example of R-OUTCOME algorithm using: # - An optimization procedure for individual predictions on the 2 databases # - The Manhattan distance # - Raw covariates ### ----- R_OUTC1 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "M", maxrelax = 0, indiv.method = "optimal" ) ### An example of R-OUTCOME algorithm with: # - An optimization procedure for individual predictions on the 2 databases # - The use of Euclidean distance on coordinates from FAMD # - Raw covariates ### ----- R_OUTC2 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "E", FAMD.coord = "YES", FAMD.perc = 0.8, indiv.method = "optimal" ) ### An example of R-OUTCOME algorithm with relaxation on marginal distributions and: # - An optimization procedure for individual predictions on the 2 databases # - The use of the euclidean distance # - An arbitrary coefficient of relaxation # - Raw covariates #----- R_OUTC3 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "E", maxrelax = 0.4, indiv.method = "optimal" )
A function that gives the power set of any non empty set S.
power_set(n, ordinal = FALSE)
power_set(n, ordinal = FALSE)
n |
an integer. The cardinal of the set |
ordinal |
a boolean. If TRUE the power set is only composed of subsets of consecutive elements, FALSE (by default) otherwise. |
A list of subsets (The empty set is excluded)
Gregory Guernec
Devlin, Keith J (1979). Fundamentals of contemporary set theory. Universitext. Springer-Verlag
# Powerset of set of 4 elements set1 <- power_set(4) # Powerset of set of 4 elements by only keeping # subsets of consecutive elements set2 <- power_set(4, ordinal = TRUE)
# Powerset of set of 4 elements set1 <- power_set(4) # Powerset of set of 4 elements by only keeping # subsets of consecutive elements set2 <- power_set(4, ordinal = TRUE)
proxim_dist
computes the pairwise distance matrix of a database and cross-distance matrix between two databases according to various distances used in the context of data fusion.
proxim_dist(data_file, indx_DB_Y_Z = 1:3, norm = "E", prox = 0.3)
proxim_dist(data_file, indx_DB_Y_Z = 1:3, norm = "E", prox = 0.3)
data_file |
a data.frame corresponding ideally to an output object of the function |
indx_DB_Y_Z |
a vector of three column indexes corresponding to the database identifier, the target variable of the above database and the target variable of the below database. The indexes must be declared in this specific order. |
norm |
a character string indicating the choice of the distance function. This latest depends on the type of the common covariates: the Hamming distance
for binary covariates only ( |
prox |
a ratio (betwen 0 and 1) used to calculate the distance threshold below which an individual (a row or a given statistical unit) is considered as a neighbor of a given profile of covariates. 0.3 is the default value. |
This function is the first step of a family of algorithms that solve recoding problems of data fusion using optimal transportation theory (see the details of these corresponding models OUTCOME
, R_OUTCOME
, JOINT
and R_JOINT
in (1) and (2)).
The function proxim_dist
is directly implemented in the functions OT_outcome
and OT_joint
but can also be used separately as long as the input database has as suitable structure. Nevertheless, its preparation will have to be rigorously made in two steps detailled in the following sections.
A. EXPECTED STRUCTURE FOR THE INPUT DATABASE
Firsly, the initial database required is a data.frame that must be prepared in a specific form by users. From two separate databases, the function merge_dbs
available in this package can assist users in this initial merging, nevertheless notice that this preliminary transformation can also be made directly by following the imposed structure described below:
two overlayed databases containing a common column of database identifiers (A and B for examples, encoded in numeric or factor form),
a column corresponding to the target variable with its specific encoding in A (for example a factor encoded in
levels, ordered or not, with NAs in the corresponding rows of B), a column corresponding to the same variable with its specific endoded in B (for example a factor
in
levels,
with NAs in database A), and a set of shared covariates (at least one) between the two databases.
The order of these variables in the database have no importance but the column indexes related to database identifier, and
, must be specified in the
indx_DB_Y_Z
option.
Users can refer to the structure of the table simu_data
available in the package to adapt their databases to the inital format required.
Missing values are allowed on covariates only, and are excluded from all computations involving the rows within which they occur.
In the particular case where only one covariate with NAs is used, we recommend working with imputed or complete case only to avoid the presence of NA in the distance matrix that will be computed a posteriori.
If the database counts many covariates and some of them have missing data, user can keep them or apply beforehand the imput_cov
function on data.frame to deal with this problem.
B. DISTANCE FUNCTIONS AND TYPES OF COVARIATES
In a second step, the shared variables of the merged database will have to be encoded according to the choice of the distance function fixed by user, knowing that it is also frequent that it is the type of the variables which fixes the distance function to choose.
The function transfo_dist
is available in the package to assist users in this task but a user can also decide to make this preparation by themselves.
Thus, with the Euclidean or Manhattan distance ((3), norm
= "E" or "M"), if all types of variables are allowed, logical variables are transformed in binary variables, and categorical variables (factors ordered or not) are replaced by their related disjunctive tables (the function transfo_quali
can make these specific transformations).
The Hamming distance (norm
= "H") only requires binary variables (all other forms are not allowed). In this context, continuous variables could have been converted in factor of k levels () beforehand. The categorical covariates are then transformed in disjunctive tables (containing the (
) corresponding binary variables) before use. With this distance, categorical variables are also transformed in disjunctive tables.
Notice that, using the Hamming distance could be quite long in presence of NAs on covariates.
Finally, the Gower distance ((4),
norm
= "G") uses the (gower.dist
) function (5) and so allows logical, categorical and numeric variables without preliminary transformations.
In conclusion, the structure of the data.frame required in input of the function proxim_dist
corresponds to two overlayed databases with two target outcomes and a set of shared covariates whose encodings depend on the distance function choosen by user.
If some columns are excluded when computing an Euclidean, Manhattan, or Hamming distance between two rows, the sum is scaled up proportionally to the number of columns used in the computation as proposed by the standard (dist
) function.
If all pairs are excluded when computing a particular distance, instead of putting NA in the corresponding cell of the distance matrix, the process stops and an object listing the problematic rows is proposed in output.
It suggests users to remove these rows before running the process again or impute NAs related to these rows (see (6) for more details).
C. PROFILES OF COVARIATES AND OUTPUT DETAILS
Whatever the type (mixed or not) and the number of covariates in the data.frame of interest, the function proxim_dist
firstly detects all the possible profiles (or combinations) of covariates from the two databases, and saves them in the output profile
.
For example, assuming that a data.frame in input (composed of two overlayed data.frames A and B) have three shared binary covariates (identically encoded in A and B) so the sequences 011
and 101
will be considered as two distinct profiles of covariates.
If each covariate is a factor of ,
and
levels respectively, so it exists at most
possible profiles of covariates.
This number is considered as a maximum here because only the profiles of covariates met in at least one of the two databases will be kept for the study.
proxim_dist
classifies individuals from the two databases according to their proximities to each profile of covariates and saves the corresponding indexes of rows from A and B in two lists indXA
and indXB
respectively.
indXA
and indXB
thus contain as many objects as covariates profiles and the proximity between a given profile and a given individual is defined as follows.
The function also provides in output the list of all the encountered profiles of covariates.
As a decision rule, for a given profile of covariates , an individual
will be considered as a neighbor of
if
where
prox
will be fixed by user.
Set the value 0 to the prox
parameter assures that each individual of A (and B respectively) is exactly the profile of one profile of covariates. Therefore, it is not recommended in presence of continuous coavariates.
Conversely, assign the value 1 to prox
is not recommended because it assumes that each individual is neighbor with all the encountered profiles of covariates.
A list of 16 elements (the first 16 detailed below) is returned containing various distance matrices and lists useful for the algorithms that used Optimal Transportation theory. Two more objects (the last two of the following list) will be returned if distance matrices contain NAs.
FILE_NAME |
a simple reminder of the name of the raw database |
nA |
the number of rows of the first database (A) |
nB |
the number of rows of the second database (B) |
Xobserv |
the subset of the two overlayed databases composed of the shared variables only |
profile |
the different encountered profiles of covariates according to the data.frame |
Yobserv |
the numeric values of the target variable in the first database |
Zobserv |
the numeric values of the target variable in the second database |
D |
a distance matrix corresponding to the computed distances between individuals of the two databases |
Y |
the |
Z |
the |
indY |
a list of |
indZ |
a list of |
indXA |
a list of individual (row) indexes from the first database, sorted by profiles of covariates according to their proximities. See the |
indXB |
a list of individual (row) indexes from the second database, sorted by profiles of covariates according to their proximities. See the |
DA |
a distance matrix corresponding to the pairwise distances between individuals of the first database |
DB |
a distance matrix corresponding to the pairwise distances between individuals of the second database |
ROWS_TABLE |
combinations of row numbers of the two databases that generate NAs in D |
ROWS_TO_RM |
number of times a row of the first or second database is involved in the NA process of D |
Gregory Guernec, Valerie Gares, Jeremy Omer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
Anderberg, M.R. (1973), Cluster analysis for applications, 359 pp., Academic Press, New York, NY, USA.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 623–637.
D'Orazio M. (2015). Integration and imputation of survey data in R: the StatMatch package. Romanian Statistical Review, vol. 63(2)
Borg, I. and Groenen, P. (1997) Modern Multidimensional Scaling. Theory and Applications. Springer.
transfo_dist
, imput_cov
, merge_dbs
, simu_data
data(simu_data) ### The covariates of the data are prepared according to the chosen distance ### using the transfo_dist function ### Ex 1: The Manhattan distance man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") # man1 compatible with norm = "E" for Euclidean ### Ex 2: The Euclidean and Manhattan distance applied on coordinates from FAMD eucl_famd <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "FAMD", info = 0.80 ) mat_e_famd <- proxim_dist(eucl_famd, norm = "E") mat_m_famd <- proxim_dist(eucl_famd, norm = "M") ### Ex 3: The Gower distance with mixed covariates gow1 <- transfo_dist(simu_data[c(1:100, 301:400), ], quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "G" ) mat_gow1 <- proxim_dist(gow1, norm = "G") ### Ex 4a: The Hamming distance with binary (but incomplete) covariates only # categorization of the continuous covariates age by tertiles ham1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), convert_num = 8, convert_class = 3, prep_choice = "H" ) mat_ham1 <- proxim_dist(ham1, norm = "H") # Be patient ... It could take few minutes ### Ex 4b: The Hamming distance with complete cases on nominal and ordinal covariates only simu_data_CC <- simu_data[(!is.na(simu_data[, 5])) & (!is.na(simu_data[, 6])) & (!is.na(simu_data[, 7])), 1:7] ham2 <- transfo_dist(simu_data_CC, quanti = 3, nominal = c(1, 4:5, 7), ordinal = c(2, 6), prep_choice = "H" ) mat_ham2 <- proxim_dist(ham2, norm = "H") ### Ex 5: PARTICULAR CASE, If only one covariate with no NAs man2 <- man1[, c(1:3, 7)] # Only Smoking variable man2_nona <- man2[!is.na(man2[, 4]), ] # Keep complete case mat_man2_nona <- proxim_dist(man2_nona, norm = "M", prox = 0.10) mat_man2_nona_H <- proxim_dist(man2_nona, norm = "H") # Hamming ### Ex 6: PARTICULAR CASE, many covariates but NAs in distance matrix # We generated NAs in the man1 object so that: # dist(A4,B102) and dist(A122,B102) returns NA whatever the norm chosen: man1b <- man1 man1b[4, 7:9] <- NA man1b[122, 6:9] <- NA man1b[300 + 102, 4:6] <- NA mat_man3 <- proxim_dist(man1b, norm = "M") # The process stopped indicates 2 NAs and the corresponding row numbers # The 2nd output of mat_man3 indicates that removing first the 102th row of the database # B is enough to solve the pb: man1c <- man1b[-402, ] mat_man4 <- proxim_dist(man1c, norm = "M")
data(simu_data) ### The covariates of the data are prepared according to the chosen distance ### using the transfo_dist function ### Ex 1: The Manhattan distance man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") # man1 compatible with norm = "E" for Euclidean ### Ex 2: The Euclidean and Manhattan distance applied on coordinates from FAMD eucl_famd <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "FAMD", info = 0.80 ) mat_e_famd <- proxim_dist(eucl_famd, norm = "E") mat_m_famd <- proxim_dist(eucl_famd, norm = "M") ### Ex 3: The Gower distance with mixed covariates gow1 <- transfo_dist(simu_data[c(1:100, 301:400), ], quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "G" ) mat_gow1 <- proxim_dist(gow1, norm = "G") ### Ex 4a: The Hamming distance with binary (but incomplete) covariates only # categorization of the continuous covariates age by tertiles ham1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), convert_num = 8, convert_class = 3, prep_choice = "H" ) mat_ham1 <- proxim_dist(ham1, norm = "H") # Be patient ... It could take few minutes ### Ex 4b: The Hamming distance with complete cases on nominal and ordinal covariates only simu_data_CC <- simu_data[(!is.na(simu_data[, 5])) & (!is.na(simu_data[, 6])) & (!is.na(simu_data[, 7])), 1:7] ham2 <- transfo_dist(simu_data_CC, quanti = 3, nominal = c(1, 4:5, 7), ordinal = c(2, 6), prep_choice = "H" ) mat_ham2 <- proxim_dist(ham2, norm = "H") ### Ex 5: PARTICULAR CASE, If only one covariate with no NAs man2 <- man1[, c(1:3, 7)] # Only Smoking variable man2_nona <- man2[!is.na(man2[, 4]), ] # Keep complete case mat_man2_nona <- proxim_dist(man2_nona, norm = "M", prox = 0.10) mat_man2_nona_H <- proxim_dist(man2_nona, norm = "H") # Hamming ### Ex 6: PARTICULAR CASE, many covariates but NAs in distance matrix # We generated NAs in the man1 object so that: # dist(A4,B102) and dist(A122,B102) returns NA whatever the norm chosen: man1b <- man1 man1b[4, 7:9] <- NA man1b[122, 6:9] <- NA man1b[300 + 102, 4:6] <- NA mat_man3 <- proxim_dist(man1b, norm = "M") # The process stopped indicates 2 NAs and the corresponding row numbers # The 2nd output of mat_man3 indicates that removing first the 102th row of the database # B is enough to solve the pb: man1c <- man1b[-402, ] mat_man4 <- proxim_dist(man1c, norm = "M")
Selection of a subset of non collinear predictors having relevant relationships with a given target outcome using a random forest procedure.
select_pred( databa, Y = NULL, Z = NULL, ID = 1, OUT = "Y", quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL, convert_num = NULL, convert_class = NULL, thresh_cat = 0.3, thresh_num = 0.7, thresh_Y = 0.2, RF = TRUE, RF_ntree = 500, RF_condi = FALSE, RF_condi_thr = 0.2, RF_SEED = sample(1:1e+06, 1) )
select_pred( databa, Y = NULL, Z = NULL, ID = 1, OUT = "Y", quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL, convert_num = NULL, convert_class = NULL, thresh_cat = 0.3, thresh_num = 0.7, thresh_Y = 0.2, RF = TRUE, RF_ntree = 500, RF_condi = FALSE, RF_condi_thr = 0.2, RF_SEED = sample(1:1e+06, 1) )
databa |
a data.frame with a column of identifiers (of row or of database in the case of two concatened databases), an outcome, and a set of predictors. The number of columns can exceed the number of rows. |
Y |
the label of a first target variable with quotes |
Z |
the label of a second target variable with quotes when |
ID |
the column index of the database identifier (The first column by default) in the case of two concatened databases, a row identifier otherwise |
OUT |
a character that indicates the outcome to predict in the context of overlayed databases. By default, the outcome declared in the argument |
quanti |
a vector of integers corresponding to the column indexes of all the numeric predictors. |
nominal |
a vector of integers which corresponds to the column indexes of all the categorical nominal predictors. |
ordinal |
a vector of integers which corresponds to the column indexes of all the categorical ordinal predictors. |
logic |
a vector of integers indicating the indexes of logical predictors. No index remained by default |
convert_num |
a vector of integers indicating the indexes of quantitative variables to convert in ordered factors. No index remained by default. Each index selected has to be defined as quantitative in the argument |
convert_class |
a vector of integers indicating the number of classes related to each transformation of quantitative variable in ordered factor. The length of this vector can not exceed the length of the argument |
thresh_cat |
a threshold associated to the Cramer's V coefficient (= 0.30 by default) |
thresh_num |
a threshold associated to the Spearman's coefficient of correlation (= 0.70 by default) |
thresh_Y |
a threshold linked to the RF approach, that corresponds to the minimal cumulative percent of importance measure required to be kept in the final list of predictors. |
RF |
a boolean sets to TRUE (default) if a random forest procedure must be applied to select the best subset of predictors according to the outcome.Otherwise, only pairwise associations between predictors are used for the selection. |
RF_ntree |
the number of bootsrap samples required from the row datasource during the random forest procedure |
RF_condi |
a boolean specifying if the conditional importance measures must be assessed from the random forest procedure ( |
RF_condi_thr |
a threshold linked to (1 - pvalue) of an association test between each predictor |
RF_SEED |
an integer used as argument by the set.seed() for offsetting the random number generator (random integer by default). This value is only used for RF method. |
The select_pred
function provides several tools to identify, on the one hand, the relationships between predictors, by detecting especially potential problems of collinearity, and, on the other hand, proposes a parcimonious subset of relevant predictors (of the outcome) using appropriate random forest procedures.
The function which can be used as a preliminary step of prediction in regression areas is particularly adapted to the context of data fusion by providing relevant subsets of predictors (the matching variables) to algorithms dedicated to the solving of recoding problems.
A. REQUIRED STRUCTURE FOR THE DATABASE
The expected input database is a data.frame that especially requires a specific column of row identifier and a target variable (or outcome) having a finite number of values or classes (ordinal, nominal or discrete type). Notice that if the chosen outcome is in numeric form, it will be automatically converted in ordinal type.
The number of predictors is not a constraint for select_pred
(even if, with less than three variables a process of variables selection has no real sense...), and can exceed the number of rows (no problem of high dimensionality here).
The predictors can be continuous (quantitative), boolean, nominal or ordinal with or without missing values.
In presence of numeric variables, users can decide to discretize them or a part of them by themselves beforehand. They can also choose to use the internal process directly integrated in the function. Indeed, to assist users in this task, two arguments called convert_num
and convert_class
dedicated to these transformations are available in input of the function.
These options make the function select_pred
particularly adapted to the function OT_joint
which only allows data.frame with categorical covariates.
With the argument convert_num
, users choose the continuous variables to convert and the related argument convert_class
specifies the corresponding number of classes chosen for each discretization.
It is the reason why these two arguments must be two vectors of indexes of same length. Nevertheless, an unique exception exists when convert_class
is equalled to a scalar . In this case, all the continuous predictors selected for conversion will be discretized with a same number of classes S.
By example, if
convert_class = 4
, all the continuous variables specified in the convert_num
argument will be discretized by quartiles. Moreover, notice that missing values from incomplete predictors to convert are not taken into account during the conversion, and that each predictor specified in the argument convert_num
must be also specified in the argument quanti
.
In this situation, the label of the outcome must be entered in the argument Y
, and the arguments Z
and OUT
must keep their default values.
Finally, the order of the column indexes related to the identifier and the outcome have no importance.
For a better flexibility, the input database can also be the result of two overlayed databases.
In this case, the structure of the database must be similar to those observed in the datasets simu_data
and tab_test
available in the package with a column of database identifier, one target outcome by database (2 columns), and a subset of shared predictors.
Notice that, overlaying two separate databases can also be done easily using the function merge_dbs
beforehand.
The labels of the two outcomes will have to be specified in the arguments Y
for the top database, and in Z
for the bottom one.
Notice also that the function select_pred
deals with only one outcome at a time that will have to be specified in the argument OUT
which must be equalled to "Y" for the study of the top database or "Z" for the study of the bottom one.
Finally, whatever the structure of the database declared in input, each column index related to the database variable must be entered once (and only once) in one of the following four arguments: quanti
, nominal
, ordinal
, logic
.
B. PAIRWISE ASSOCIATIONS BETWEEN PREDICTORS
In a first step of process, select_pred
calculates standard pairwise associations between predictors according to their types.
Between categorical predictors (ordinal, nominal and logical):
Cramer's V (and Bias-corrected Cramer's V, see (1) for more details) are calculated between categorical predictors and the argument thres_cat
fixed the associated threshold beyond which two predictors can be considered as redundant.
A similar process is done between the target variable and the subset of categorical variables which provides in output a first table ranking the top scoring predictors. This table summarizes the ability of each variable to predict the target outcome.
Between continuous predictors:
If the ordinal
and logic
arguments differ from NULL, all the corresponding predictors are beforehand converted in rank values.
For numeric (quantitative), logical and ordinal predictors, pairwise correlations between ranks (Spearman) are calculated and the argument thresh_num
fixed the related threshold beyond which two predictors can be considered as redundant.
A similar process is done between the outcome and the subset of discrete variables which provides in output, a table ranking the top scoring predictor variates which summarizes their abilities to predict the target.
In addition, the result of a Farrar and Glauber test is provided. This test is based on the determinant of the correlation matrix of covariates and the related null hypothesis of the test corresponds to an absence of collinearity between them (see (2) for more details about the method).
In presence of a large number of numeric covariates and/or ordered factors, the approximate Farrar-Glauber test, based on the normal approximation of the null distribution is more adapted and its result is also provided in output.
These two tests are highly sensitive and, by consequence, it suggested to consider these results as simple indicators of collinearity between predictors rather than an essential condition of acceptability.
If the initial number of predictors is not too important, these informations can be sufficient to the user for the visualization of potential problems of collinearity and for the selection of a subset of predictors (RF = FALSE
).
It is nevertheless often necessary to complete this visualization by an automatical process of selection like the Random Forest approach (see Breiman 2001, for a better understanding of the method) linked to the function select_pred
(RF = TRUE
).
C. RANDOM FOREST PROCEDURE
As a final step of the process, a random forest approach (RF(3)) is here prefered (to regression models) for two main reasons: RF methods allow notably the number of variables to exceed the number of rows and remain applicable whatever the types of covariates considered.
The function select_pred
integrates in its algorithm the functions cforest
and varimp
of the package party (Hothorn, 2006) and so gives access to their main arguments.
A RF approach generally provides two types of measures for estimating the mean variable importance of each covariate in the prediction of an outcome: the Gini importance and the permutation importance. These measurements must be used with caution, by taking into account the following constraints:
The Gini importance criterion can produce bias in favor of continuous variables and variables with many categories. To avoid this problem, only the permutation criterion is available in the function.
The permutation importance criterion can overestimate the importance of highly correlated predictors.
The function select_pred
proposes three different scenarios according to the types of predictors:
The first one consists in boiling down to a set of categorical variables (ordered or not) by discretizing all the continuous predictors beforehand, using the internal convert_num
argument or another one, and then works with the conditional importance measures (RF_condi = TRUE
) which give unbiased estimations.
In the spirit of a partial correlation, the conditional importance measure related to a variable for the prediction of an outcome
, only uses the subset of variables the most correlated to
for its computation. The argument
RF_condi_thr
that corresponds exactly to the argument threshold
of the function varimp
,
fixes a ratio below which a variable Z is considered sufficiently correlated to to be used as an adjustment variable in the computation of the importance measure of
(In other words, Z is included in the conditioning for the computation, see (4) and (5) for more details). A threshold value of zero will include all variables in the computation of
conditional importance measure of each predictor
, while a threshold
, will only include a subset of variables.
Two remarks related to this method: firstly, notice that taking into account only subsets of predictors in the computation of the variable importance measures could lead to a relevant saving of execution time.
Secondly, because this approach does not take into account incomplete information, the method will only be applied to complete data (incomplete rows will be temporarily removed for the study).
The second possibility, always in presence of mixed types predictors, consists in the execution of two successive RF procedures. The first one will be used to select an unique candidate in each susbset of correlated predictors (detecting in the 1st section), while the second one will extract the permutation measures from the remaining subset
of uncorrelated predictors (RF_condi = FALSE
, by default). This second possibility has the advantage to work in presence of incomplete predictors.
The third scenario consists in running a first time the function without RF process (RF = FALSE
), and according to the presence of highly correlated predictors or not, users can choose to extract redundant predictors manually and re-runs the function with the subset of remaining non-collinear predictors to avoid potential biases introduced by the standard permutations measures.
The three scenarios finally lead to a list of uncorrelated predictors of the outcome sorted in importance order. The argument thresh_Y
corresponds to the minimal percent of importance required (and fixed by user) for a variable to be considered as a reliable predictor of the outcome.
Finally, because all random forest results are subjects to random variation, users can check whether the same importance ranking is achieved by varying the random seed parameter (RF_SEED
) or by increasing the number of trees (RF_ntree
).
A list of 14 (if RF = TRUE
) or 11 objects (Only the first ten objects if RF = FALSE
) is returned:
seed |
the random number generator related to the study |
outc |
the identifier of the outcome to predict |
thresh |
a summarize of the different thresholds fixed for the study |
convert_num |
the labels of the continuous predictors transformed in categorical form |
DB_USED |
the final database used after potential transformations of predictors |
vcrm_OUTC_cat |
a table of pairwise associations between the outcome and the categorical predictors (Cramer's V) |
cor_OUTC_num |
a table of pairwise associations between the outcome and the continuous predictors (Rank correlation) |
vcrm_X_cat |
a table of pairwise associations between the categorical predictors (Cramer's V) |
cor_X_num |
a table of pairwise associations between the continuous predictors (Cramer's V) |
FG_test |
the results of the Farrar and Glauber tests, with and without approximation form |
collinear_PB |
a table of predictors with problem of collinearity according to the fixed thresholds |
drop_var |
the labels of predictors to drop after RF process (optional output: only if |
RF_PRED |
the table of variable importance measurements, conditional or not, according to the argument |
RF_best |
the labels of the best predictors selected (optional output: Only if |
Gregory Guernec
Bergsma W. (2013). A bias-correction for Cramer's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, 323–328.
Farrar D, and Glauber R. (1968). Multicolinearity in regression analysis. Review of Economics and Statistics, 49, 92–107.
Breiman L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Hothorn T, Buehlmann P, Dudoit S, Molinaro A, Van Der Laan M (2006). “Survival Ensembles.” Biostatistics, 7(3), 355–373.
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307
### Example 1 #----- # - From two overlayed databases: using the table simu_data # - Searching for the best predictors of "Yb1" # - Using the row database # - The RF approaches are not required #----- data(simu_data) sel_ex1 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = FALSE ) ### Example 2 #----- # - With same conditions as example 1 # - Searching for the best predictors of "Yb2" #----- sel_ex2 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = FALSE ) ### Example 3 #----- # - With same conditions as example 1 # - Using a RF approach to estimate the standard variable importance measures # and determine the best subset of predictors # - Here a seed is required #----- sel_ex3 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = TRUE, RF_condi = FALSE, RF_SEED = 3023 ) ### Example 4 #----- # - With same conditions as example 1 # - Using a RF approach to estimate the conditional variable importance measures # and determine the best subset of predictors # - This approach requires to convert the numeric variables: Only "Age" here # discretized in 3 levels #----- sel_ex4 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), convert_num = 8, convert_class = 3, thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023 ) ### Example 5 #----- # - Starting with a unique database # - Same conditions as example 1 #----- simu_A <- simu_data[simu_data$DB == "A", -3] # Base A sel_ex5 <- select_pred(simu_A, Y = "Yb1", quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = FALSE ) ### Example 6 #----- # - Starting with an unique database # - Using a RF approach to estimate the conditional variable importance measures # and determine the best subset of predictors # - This approach requires to convert the numeric variables: Only "Age" here # discretized in 3 levels #----- simu_B <- simu_data[simu_data$DB == "B", -2] # Base B sel_ex6 <- select_pred(simu_B, Y = "Yb2", quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5), convert_num = 7, convert_class = 3, thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023 )
### Example 1 #----- # - From two overlayed databases: using the table simu_data # - Searching for the best predictors of "Yb1" # - Using the row database # - The RF approaches are not required #----- data(simu_data) sel_ex1 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = FALSE ) ### Example 2 #----- # - With same conditions as example 1 # - Searching for the best predictors of "Yb2" #----- sel_ex2 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = FALSE ) ### Example 3 #----- # - With same conditions as example 1 # - Using a RF approach to estimate the standard variable importance measures # and determine the best subset of predictors # - Here a seed is required #----- sel_ex3 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = TRUE, RF_condi = FALSE, RF_SEED = 3023 ) ### Example 4 #----- # - With same conditions as example 1 # - Using a RF approach to estimate the conditional variable importance measures # and determine the best subset of predictors # - This approach requires to convert the numeric variables: Only "Age" here # discretized in 3 levels #----- sel_ex4 <- select_pred(simu_data, Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z", quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), convert_num = 8, convert_class = 3, thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023 ) ### Example 5 #----- # - Starting with a unique database # - Same conditions as example 1 #----- simu_A <- simu_data[simu_data$DB == "A", -3] # Base A sel_ex5 <- select_pred(simu_A, Y = "Yb1", quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5), thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = FALSE ) ### Example 6 #----- # - Starting with an unique database # - Using a RF approach to estimate the conditional variable importance measures # and determine the best subset of predictors # - This approach requires to convert the numeric variables: Only "Age" here # discretized in 3 levels #----- simu_B <- simu_data[simu_data$DB == "B", -2] # Base B sel_ex6 <- select_pred(simu_B, Y = "Yb2", quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5), convert_num = 7, convert_class = 3, thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20, RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023 )
The first 300 rows belong to the database A, while the next 400 rows belong to the database B.
Five covariates: Gender
, Treatment
, Dosage
, Smoking
and Age
are
common to both databases (same encodings). Gender
is the only complete covariate.
The variables Yb1
and Yb2
are the target variables of A and B respectively, summarizing a same information encoded in two different scales.
that summarize a same information saved in two distinct encodings, that is why, Yb1
is
missing in the database B and Yb2
is missing in the database A.
simu_data
simu_data
A data.frame made of 2 overlayed databases (A and B) with 700 observations on the following 8 variables.
the database identifier, a character with 2 possible classes: A
or B
the target variable of the database A, stored as factor and encoded in 3 ordered levels: [20-40]
, [40-60[
,[60-80]
(the values related to the database B are missing)
the target variable of the database B, stored as integer (an unknown scale from 1 to 5) in the database B (the values related to A are missing)
a factor with 2 levels (Female
or Male
) and no missing values
a covariate of 3 classes stored as a character with 2% of missing values: Placebo
, Trt A
, Trt B
a factor with 4 levels and 5% of missing values: from Dos 1
to dos 4
a covariate of 2 classes stored as a character and 10% of missing values: NO
for non smoker, YES
otherwise
a numeric corresponding to the age of participants in years. This variable counts 5% of missing values
The purpose of the functions contained in this package is to predict the missing information on Yb1
and Yb2
in database A and database B using the Optimal Transportation Theory.
Missing information has been simulated to some covariates following a simple MCAR process.
randomly generated
A dataset of 10000 rows containing 3 covariables and 2 outcomes.
tab_test
tab_test
A data frame with 5000 rows and 6 variables:
identifier, 1 or 2
outcome 1 with 2 levels, observed for ident=1 and unobserved for ident=2
outcome 2 with 4 levels, observed for ident=2 and unobserved for ident=1
covariate 1, integer
covariate 2, integer
covariate 3, integer
randomly generated
This function prepares an overlayed database for data fusion according to the distance function chosen to evaluate the proximities between units.
transfo_dist( DB, index_DB_Y_Z = 1:3, quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL, convert_num = NULL, convert_class = NULL, prep_choice = "E", info = 0.8 )
transfo_dist( DB, index_DB_Y_Z = 1:3, quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL, convert_num = NULL, convert_class = NULL, prep_choice = "E", info = 0.8 )
DB |
a data.frame composed of exactly two overlayed databases with a column of database identifier, two columns corresponding to a same information differently encoded in the two databases and covariates. The order of the variables have no importance. |
index_DB_Y_Z |
a vector of exactly three integers. The first integer must correspond to the column index of the database identifier. The second integer corresponds to the index of the target variable in the first database while the third integer corresponds to the index of column related to the target variable in the second database. |
quanti |
the column indexes of all the quantitative variables (database identificatier and target variables included) stored in a vector. |
nominal |
the column indexes of all the nominal (not ordered) variables (DB identification and target variables included) stored in a vector. |
ordinal |
the column indexes of all the ordinal variables (DB identification and target variables included) stored in a vector. |
logic |
the column indexes of all the boolean variables stored in a vector. |
convert_num |
the column indexes of the continuous (quantitative) variables to convert in ordered factors. All indexes declared in this argument must have been declared in the argument |
convert_class |
according to the argument |
prep_choice |
a character string corresponding to the distance function chosen between: the euclidean distance ("E", by default), the Manhattan distance ("M"), the Gower distance ("G"), the Hamming (also called binary) distance ("H"), and a distance computed from principal components of a factor analysis of mixed data ("FAMD"). |
info |
a ratio (between 0 and 1, 0.8 is the default value) that corresponds to the minimal part of variability that must be taken into account by the remaining principal components of the FAMD when this approach is required. This ratio will fix the number of components that will be kept with this approach. When the argument is set to 1, all the variability is considered. |
A. EXPECTED STRUCTURE FOR THE INPUT DATABASE
In input of this function, the expected database is the result of an overlay between two databases A and B.
This structure can be guaranteed using the specific outputs of the functions merge_dbs
or select_pred
.
Nevertheless, it is also possible to apply directly the function transfo_dist
on a raw database provided that a specific structure is respected in input.
The overlayed database (A placed on top of B) must count at least four columns (in an a unspecified order of appearance in the database):
A column indicating the database identifier (two classes or levels if factor: A and B, 1 and 2, ...)
A column dedicated to the outcome (or target variable) of the first database and denoted for example. This variable can be of categorical (nominal or ordinal factor) or continuous type. Nevertheless, in this last case, a warning will appear and the variable will be automatically converted in ordered factors as a prerequisite format of the database before using data fusion algorithms.
A column dedicated to the outcome (or target variable) of the second database and denoted for example. As before, this variable can be of categorical (nominal or ordinal factor) or continuous type, and the variable will be automatically converted in ordered factors as a prerequisite format of the database before using data fusion algorithms.
At least one shared variable (same encoding in the two databases). Incomplete information is possible on shared covariates only with more than one shared covariate in the final database.
In this context, the two databases are overlayed and the information related to in the second database must be missing as well as the information related to
in the first one.
The column indexes related to the database identifier,
and
must be specified in this order in the argument
index_DB_Y_Z
.
Moreover, all column indexes (including those related to identifier and target variables and
) of the overlayed database (DB) must be declared once (and only once), among the arguments
quanti
, nominal
, ordinal
, and logic
.
B. TRANSFORMATIONS OF CONTINUOUS COVARIATES
Because some algorithms dedicated to solving recoding problems like JOINT
and R-JOINT
(see (1) and/or the documentation of OT_joint
) requires the use of no continuous covariates, the function transfo_dist
integrates in is syntax
a process dedicated to the categorization of continuous variables. For this, it is necessary to rigorously fill in the arguments convert_num
and convert_class
. The first one specifies the indexes of continuous variables to transform
in ordered factors while the second one assigns the corresponding desired number of levels.
Only covariates should be transformed (not outcomes) and missing informations are not taken into account for the transformations.
Notice that all the indexes informed in the argument convert_num
must also be informed in the argument quanti
.
C. TRANSFORMATIONS ON THE DATABASE ACCORDING TO THE CHOSEN DISTANCE FUNCTION
These necessary transformations are related to the type of each covariate.
It depends on the distance function chosen by user in the prep_choice
argument.
1. For the Euclidean ("E") and Manhattan ("M") distances (see (2) and (3)):
all the remaining continuous variables are standardized.
The related recoding to a boolean variable is 1 for TRUE
and 0 for FALSE
.
The recoding of a nominal variable of k classes corresponds to its related disjunctive table (of (k-1) binary variables)).
The ordinal variables are all converted to numeric variables (please take care that the order of the classes of each of these variables is well specified at the beginning).
2. For the Hamming ("H") distance (see (2) and (3)): all the continuous variables must be transformed beforehand in categorical forms using the internal process described in section B or via another external approach. The boolean variables are all converted in ordinal forms and then turned into binaries. The recoding for nominal or ordinal variable of k classes corresponds to its related disjunctive table (i.e (k-1) binary variables)).
3. For the Gower ("G") distance (see (4)): all covariates remain unchanged
4. Using the principal components from a factor analysis for mixed data (FAMD (5)):
a factor analysis for mixed data is applied on the covariates of the database and a specific number of the related principal components is remained (depending on the minimal part of variability explained by the covariates that the user wishes to keep by varying the info
option).
The function integrates in its syntax the function FAMD
of the package FactoMiner (6) using default parameters.
After this step, the covariates are replaced by the remaining principal components of the FAMD, and each value corresponds to coordinates linked to each component.
Please notice that this method supposed complete covariates in input, nevertheless in presence of incomplete covariates, each corresponding rows will be dropped from the study, a warning will appear and the number of remaining rows will be indicated.
A data.frame whose covariates have been transformed according to the distance function or approach (for FAMD) chosen. The columns of the data.frame could have been reordered so that the database identifier, and
correspond to the first three columns respectively.
Moreover the order of rows remains unchanged during the process.
Gregory Guernec
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
Anderberg, M.R. (1973). Cluster analysis for applications, 359 pp., Academic Press, New York, NY, USA.
Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling. Theory and Applications. Springer.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 623–637.
Pages J. (2004). Analyse factorielle de donnees mixtes. Revue Statistique Appliquee. LII (4). pp. 93-111.
Lê S, Josse J, Husson, F. (2008). FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software. 25(1). pp. 1-18.
### Using the table simu_data: data(simu_data) # 1. the Euclidean distance (same output with Manhattan distance), eucl1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "E" ) # Here Yb2 was stored in numeric: It has been automatically converted in factor # You can also convert beforehand Yb2 in ordered factor by example: sim_data <- simu_data sim_data$Yb2 <- as.ordered(sim_data$Yb2) eucl2 <- transfo_dist(sim_data, quanti = 8, nominal = c(1, 4:5, 7), ordinal = c(2, 3, 6), logic = NULL, prep_choice = "E" ) # 2. The Euclidean distance generated on principal components # by a factor analysis for mixed data (FAMD): eucl_famd <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "FAMD" ) # Please notice that this method works only with rows that have complete # information on covariates. # 3. The Gower distance for mixed data: gow1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "G" ) # 4. The Hamming distance: # Here the quanti option could only contain indexes related to targets. # Column indexes related to potential binary covariates or covariates with # finite number of values must be include in the ordinal option. # So in simu_data, the discretization of the variable age is required (index=8), # using the convert_num and convert_class arguments (for tertiles = 3): ham1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), convert_num = 8, convert_class = 3, prep_choice = "H" ) ### This function works whatever the order of your columns in your database: # Suppose that we re-order columns in simu_data: simu_data2 <- simu_data[, c(2, 4:7, 3, 8, 1)] # By changing the corresponding indexes in the index_DB_Y_Z argument, # we observe the desired output: eucl3 <- transfo_dist(simu_data2, index_DB_Y_Z = c(8, 1, 6), quanti = 6:7, nominal = c(2:3, 5, 8), ordinal = c(1, 4), logic = NULL, prep_choice = "E" )
### Using the table simu_data: data(simu_data) # 1. the Euclidean distance (same output with Manhattan distance), eucl1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "E" ) # Here Yb2 was stored in numeric: It has been automatically converted in factor # You can also convert beforehand Yb2 in ordered factor by example: sim_data <- simu_data sim_data$Yb2 <- as.ordered(sim_data$Yb2) eucl2 <- transfo_dist(sim_data, quanti = 8, nominal = c(1, 4:5, 7), ordinal = c(2, 3, 6), logic = NULL, prep_choice = "E" ) # 2. The Euclidean distance generated on principal components # by a factor analysis for mixed data (FAMD): eucl_famd <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "FAMD" ) # Please notice that this method works only with rows that have complete # information on covariates. # 3. The Gower distance for mixed data: gow1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "G" ) # 4. The Hamming distance: # Here the quanti option could only contain indexes related to targets. # Column indexes related to potential binary covariates or covariates with # finite number of values must be include in the ordinal option. # So in simu_data, the discretization of the variable age is required (index=8), # using the convert_num and convert_class arguments (for tertiles = 3): ham1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), convert_num = 8, convert_class = 3, prep_choice = "H" ) ### This function works whatever the order of your columns in your database: # Suppose that we re-order columns in simu_data: simu_data2 <- simu_data[, c(2, 4:7, 3, 8, 1)] # By changing the corresponding indexes in the index_DB_Y_Z argument, # we observe the desired output: eucl3 <- transfo_dist(simu_data2, index_DB_Y_Z = c(8, 1, 6), quanti = 6:7, nominal = c(2:3, 5, 8), ordinal = c(1, 4), logic = NULL, prep_choice = "E" )
A function that transforms a factor of n(>1) levels in (n-1) binary variables.
transfo_quali(x, labx = NULL)
transfo_quali(x, labx = NULL)
x |
a factor |
labx |
a new label for the generated binary variables (By default the name of the factor is conserved) |
A matrix of (n-1) binary variables
Gregory Guernec
treat <- as.factor(c(rep("A", 10), rep("B", 15), rep("C", 12))) treat_bin <- transfo_quali(treat, "trt")
treat <- as.factor(c(rep("A", 10), rep("B", 15), rep("C", 12))) treat_bin <- transfo_quali(treat, "trt")
This function prepares the encoding of the target variable before running an algorithm using optimal transportation theory.
transfo_target(z, levels_order = NULL)
transfo_target(z, levels_order = NULL)
z |
a factor variable (ordered or not). A variable of another type will be, by default, convert to a factor. |
levels_order |
a vector corresponding to the values of the levels of z. When the target is ordinal, the levels can be sorted by ascending order. By default, the initial order is remained. |
The function transfo_target
is an intermediate function direcly implemented in the functions OT_outcome
and OT_joint
,
two functions dedicated to data fusion (see (1) and (2) for details). Nevertheless, this function can also be used separately to assist user in the conversion
of a target variable (outcome) according to the following rules:
A character variable is converted in factor if the argument levels_order
is set to NULL. In this case, the levels of the factor are assigned by order of appearance in the database.
A character variable is converted in ordered factor if the argument levels_order
differs from NULL. In this case, the levels of the factor correspond to those assigned in the argument.
A factor stays unchanged if the argument levels_order
is set to NULL. Otherwise the factor is converted in ordered factor and the levels are ordered according to the argument levels_order
.
A numeric variable, discrete or continuous is converted in factor if the argument levels_order
is set to NULL, and the related levels are the values assigned in ascending order.
A numeric variable, discrete or continuous is converted in ordered factor if the argument levels_order
differed from NULL, and the related levels correspond to those assigned in the argument.
The list returned is:
NEW |
an object of class factor of the same length as z |
LEVELS_NEW |
the levels (ordered or not) retained for z |
Gregory Guernec
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
y <- rnorm(100, 30, 10) ynew1 <- transfo_target(y) newlev <- unique(as.integer(y)) ynew2 <- transfo_target(y, levels_order = newlev) newlev2 <- newlev[-1] ynew3 <- transfo_target(y, levels_order = newlev2) outco <- c(rep("A", 25), rep("B", 50), rep("C", 25)) outco_new1 <- transfo_target(outco, levels_order = c("B", "C", "A")) outco_new2 <- transfo_target(outco, levels_order = c("E", "C", "A", "F")) outco_new3 <- transfo_target(outco) outco2 <- c(rep("A", 25), NA, rep("B", 50), rep("C", 25), NA, NA) gg <- transfo_target(outco2) hh <- transfo_target(outco2, levels_order = c("B", "C", "A"))
y <- rnorm(100, 30, 10) ynew1 <- transfo_target(y) newlev <- unique(as.integer(y)) ynew2 <- transfo_target(y, levels_order = newlev) newlev2 <- newlev[-1] ynew3 <- transfo_target(y, levels_order = newlev2) outco <- c(rep("A", 25), rep("B", 50), rep("C", 25)) outco_new1 <- transfo_target(outco, levels_order = c("B", "C", "A")) outco_new2 <- transfo_target(outco, levels_order = c("E", "C", "A", "F")) outco_new3 <- transfo_target(outco) outco2 <- c(rep("A", 25), NA, rep("B", 50), rep("C", 25), NA, NA) gg <- transfo_target(outco2) hh <- transfo_target(outco2, levels_order = c("B", "C", "A"))
This function proposes post-process verifications after data fusion by optimal transportation algorithms.
verif_OT( ot_out, group.class = FALSE, ordinal = TRUE, stab.prob = FALSE, min.neigb = 1 )
verif_OT( ot_out, group.class = FALSE, ordinal = TRUE, stab.prob = FALSE, min.neigb = 1 )
ot_out |
an otres object from |
group.class |
a boolean indicating if the results related to the proximity between outcomes by grouping levels are requested in output ( |
ordinal |
a boolean that indicates if |
stab.prob |
a boolean indicating if the results related to the stability of the algorithm are requested in output ( |
min.neigb |
a value indicating the minimal required number of neighbors to consider in the estimation of stability (1 by default). |
In a context of data fusion, where information from a same target population is summarized via two specific variables and
(two ordinal or nominal factors with different number of levels
and
), never jointly observed and respectively stored in two distinct databases A and B,
Optimal Transportation (OT) algorithms (see the models
OUTCOME
, R_OUTCOME
, JOINT
, and R_JOINT
of the reference (2) for more details)
propose methods for the recoding of in B and/or
in A. Outputs from the functions
OT_outcome
and OT_joint
so provides the related predictions to in B and/or
in A,
and from these results, the function
verif_OT
provides a set of tools (optional or not, depending on the choices done by user in input) to estimate:
the association between and
after recoding
the similarities between observed and predicted distributions
the stability of the predictions proposed by the algorithm
A. PAIRWISE ASSOCIATION BETWEEN AND
The first step uses standard criterions (Cramer's V, and Spearman's rank correlation coefficient) to evaluate associations between two ordinal variables in both databases or in only one database.
When the argument group.class = TRUE
, these informations can be completed by those provided by the function error_group
, which is directly integrate in the function verif_OT
.
Assuming that , and that one of the two scales of
or
is unknown, this function gives additional informations about the potential link between the levels of the unknown scale.
The function proceeds to this result in two steps. Firsty,
error_group
groups combinations of modalities of to build all possible variables
verifying
.
Secondly, the function studies the fluctuations in the association of
with each new variable
by using adapted comparisons criterions (see the documentation of
error_group
for more details).
If grouping successive classes of leads to an improvement in the initial association between
and
then it is possible to conclude in favor of an ordinal coding for
(rather than nominal)
but also to emphasize the consistency in the predictions proposed by the algorithm of fusion.
B. SIMILARITIES BETWEEN OBSERVED AND PREDICTED DISTRIBUTIONS
When the predictions of in B and/or
in A are available in the
datab
argument, the similarities between the observed and predicted probabilistic distributions of and/or
are quantified from the Hellinger distance (see (1)).
This measure varies between 0 and 1: a value of 0 corresponds to a perfect similarity while a value close to 1 (the maximum) indicates a great dissimilarity.
Using this distance, two distributions will be considered as close as soon as the observed measure will be less than 0.05.
C. STABILITY OF THE PREDICTIONS
These results are based on the decision rule which defines the stability of an algorithm in A (or B) as its average ability to assign a same prediction
of (or
) to individuals that have a same given profile of covariates
and a same given level of
(or
respectively).
Assuming that the missing information of in base A was predicted from an OT algorithm (the reasoning will be identical with the prediction of
in B, see (2) and (3) for more details), the function
verif_OT
uses the conditional probabilities stored in the
object estimatorZA
(see outputs of the functions OT_outcome
and OT_joint
) which contains the estimates of all the conditional probabilities of in A, given a profile of covariates
and given a level of
.
Indeed, each individual (or row) from A, is associated with a conditional probability
and averaging all the corresponding estimates can provide an indicator of the predictions stability.
The function OT_joint
provides the individual predictions for subject :
,
according to the the maximum a posteriori rule:
The function OT_outcome
directly deduces the individual predictions from the probablities computed in the second part of the algorithm (see (3)).
It is nevertheless common that conditional probabilities are estimated from too rare covariates profiles to be considered as a reliable estimate of the reality.
In this context, the use of trimmed means and standard deviances is suggested by removing the corresponding probabilities from the final computation.
In this way, the function provides in output a table (eff.neig
object) that provides the frequency of these critical probabilities that must help the user to choose.
According to this table, a minimal number of profiles can be imposed for a conditional probability to be part of the final computation by filling in the min.neigb
argument.
Notice that these results are optional and available only if the argument stab.prob = TRUE
.
When the predictions of in A and
in B are available, the function
verif_OT
provides in output, global results and results by database.
The res.stab
table can produce NA with OT_outcome
output in presence of incomplete shared variables: this problem appears when the prox.dist
argument is set to 0 and can
be simply solved by increasing this value.
A list of 7 objects is returned:
nb.profil |
the number of profiles of covariates |
conf.mat |
the global confusion matrix between |
res.prox |
a summary table related to the association measures between |
res.grp |
a summary table related to the study of the proximity of |
hell |
Hellinger distances between observed and predicted distributions |
eff.neig |
a table which corresponds to a count of conditional probabilities according to the number of neighbors used in their computation (only the first ten values) |
res.stab |
a summary table related to the stability of the algorithm |
Gregory Guernec
Liese F, Miescke K-J. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615
OT_outcome
, OT_joint
, proxim_dist
, error_group
### Example 1 #----- # - Using the data simu_data # - Studying the proximity between Y and Z using standard criterions # - When Y and Z are predicted in B and A respectively # - Using an outcome model (individual assignment with knn) #----- data(simu_data) outc1 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "G", percent.knn = 0.90, maxrelax = 0, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "BOTH", prox.dist = 0.30 ) verif_outc1 <- verif_OT(outc1) verif_outc1 ### Example 2 #----- # - Using the data simu_data # - Studying the proximity between Y and Z using standard criterions and studying # associations by grouping levels of Z # - When only Y is predicted in B # - Tolerated distance between a subject and a profile: 0.30 * distance max # - Using an outcome model (individual assignment with knn) #----- data(simu_data) outc2 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "G", percent.knn = 0.90, maxrelax = 0, prox.dist = 0.3, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "B" ) verif_outc2 <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE) verif_outc2 ### Example 3 #----- # - Using the data simu_data # - Studying the proximity between Y and Z using standard criterions and studying # associations by grouping levels of Z # - Studying the stability of the conditional probabilities # - When Y and Z are predicted in B and A respectively # - Using an outcome model (individual assignment with knn) #----- verif_outc2b <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE, stab.prob = TRUE, min.neigb = 5) verif_outc2b
### Example 1 #----- # - Using the data simu_data # - Studying the proximity between Y and Z using standard criterions # - When Y and Z are predicted in B and A respectively # - Using an outcome model (individual assignment with knn) #----- data(simu_data) outc1 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "G", percent.knn = 0.90, maxrelax = 0, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "BOTH", prox.dist = 0.30 ) verif_outc1 <- verif_OT(outc1) verif_outc1 ### Example 2 #----- # - Using the data simu_data # - Studying the proximity between Y and Z using standard criterions and studying # associations by grouping levels of Z # - When only Y is predicted in B # - Tolerated distance between a subject and a profile: 0.30 * distance max # - Using an outcome model (individual assignment with knn) #----- data(simu_data) outc2 <- OT_outcome(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), dist.choice = "G", percent.knn = 0.90, maxrelax = 0, prox.dist = 0.3, convert.num = 8, convert.class = 3, indiv.method = "sequential", which.DB = "B" ) verif_outc2 <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE) verif_outc2 ### Example 3 #----- # - Using the data simu_data # - Studying the proximity between Y and Z using standard criterions and studying # associations by grouping levels of Z # - Studying the stability of the conditional probabilities # - When Y and Z are predicted in B and A respectively # - Using an outcome model (individual assignment with knn) #----- verif_outc2b <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE, stab.prob = TRUE, min.neigb = 5) verif_outc2b