Title: | Distributed Trimmed Scores Regression for Handling Missing Data |
---|---|
Description: | Provides functions for handling missing data using Distributed Trimmed Scores Regression and other imputation methods. It includes facilities for data imputation, evaluation metrics, and clustering analysis. It is designed to work in distributed computing environments to handle large datasets efficiently. The philosophy of the package is described in Guo G. (2024) <doi:10.1080/03610918.2022.2091779>. |
Authors: | Guangbao Guo [aut, cre, cph] , Ruiling Niu [aut] |
Maintainer: | Guangbao Guo <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2024-12-09 07:02:45 UTC |
Source: | CRAN |
This function performs DEM to handle missing data by dividing the dataset into D blocks, applying the EM imputation method within each block, and then combining the results. It calculates various evaluation metrics including RMSE, MMAE, RRE, and Consistency Proportion Index (CPP) using different hierarchical clustering methods.
DEM(data0, data.sample, data.copy, mr, km, D)
DEM(data0, data.sample, data.copy, mr, km, D)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
D |
The number of blocks to divide the data into. |
A list containing:
XDEM |
The imputed dataset. |
RMSEDEM |
The Root Mean Squared Error. |
MAEDEM |
The Mean Absolute Error. |
REDEM |
The Relative Eelative Error. |
GCVDEM |
The DEM Imputation for Generalized Cross-Validation. |
timeDEM |
The DEM algorithm execution time. |
EM
for the original EM function.
# Create a sample dataset with missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters D <- 2 # Number of blocks # Perform DEM imputation result <- DEM(data0, data.sample, data.copy, mr, km, D) # Print the results print(result$XDEM)
# Create a sample dataset with missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters D <- 2 # Number of blocks # Perform DEM imputation result <- DEM(data0, data.sample, data.copy, mr, km, D) # Print the results print(result$XDEM)
This function performs DRPCA to handle missing data by dividing the dataset into D blocks, applying the Robust Principal Component Analysis (RPCA) method to each block, and then combining the results. It calculates various evaluation metrics including RMSE, MMAE, RRE, and Generalized Cross-Validation (GCV) using different hierarchical clustering methods.
DRPCA(data0, data.sample, data.copy, mr, km, D)
DRPCA(data0, data.sample, data.copy, mr, km, D)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
D |
The number of blocks to divide the data into. |
A list containing:
XDRPCA |
The imputed dataset. |
RMSEDRPCA |
The Root Mean Squared Error. |
MAEDRPCA |
The Mean Absolute Error. |
REDRPCA |
The Relative Eelative Error. |
GCVDRPCA |
Distributed DRPCA Imputation for Generalized Cross-Validation. |
timeDRPCA |
The DRPCA algorithm execution time. |
RPCA
for the original RPCA function.
# Create a sample dataset with missing values set.seed(123) n <- 100 p <- 10 D <- 2 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n-10), (p-2))] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters result <- DRPCA(data0, data.sample, data.copy, mr, km, D) #Print the results print(result$XDRPCA)
# Create a sample dataset with missing values set.seed(123) n <- 100 p <- 10 D <- 2 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n-10), (p-2))] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters result <- DRPCA(data0, data.sample, data.copy, mr, km, D) #Print the results print(result$XDRPCA)
This function performs DTSR to handle missing data by dividing the dataset into D blocks, applying the Trimmed Scores Regression (TSR) method to each block, and then combining the results. It calculates various evaluation metrics including RMSE, MMAE, RRE, and Consistency Proportion Index (CPP) using different hierarchical clustering methods.
DTSR(data0, data.sample, data.copy, mr, km, D)
DTSR(data0, data.sample, data.copy, mr, km, D)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
D |
The number of blocks to divide the data into. |
A list containing:
XDTSR |
The imputed dataset. |
RMSEDTSR |
The Root Mean Squared Error. |
MAEDTSR |
The Mean Absolute Error. |
REDTSR |
The Relative Eelative Error. |
GCVDTSR |
The DTSR for Generalized Cross-Validation. |
timeDTSR |
The DTSR algorithm execution time. |
TSR
for the original TSR function.
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 10 D <- 2 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n-10), (p-2))] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform DTSR imputation result <- DTSR(data0, data.sample, data.copy, mr, km,D) # Print the results print(result$XDTSR)
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 10 D <- 2 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n-10), (p-2))] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform DTSR imputation result <- DTSR(data0, data.sample, data.copy, mr, km,D) # Print the results print(result$XDTSR)
This function performs Expectation-Maximization (EM) imputation on a dataset with missing values. It uses the 'imputeEM' function from the 'mvdalab' package to estimate the missing values. The function also calculates various evaluation metrics including RMSE, MMAE, and RRE. Additionally, it performs k-means and hierarchical clustering to assess the quality of the imputation.
EM(data0, data.sample, data.copy, mr, km)
EM(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
MMAE |
The Mean Absolute Error. |
RRE |
The Relative Eelative Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timeEM |
The EM algorithm execution time. |
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform EM imputation result <- EM(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform EM imputation result <- EM(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)
This function calculates the Consistency Proportion Index (CPP), a measure of the consistency of clustering results. The CPP is calculated by determining the most common cluster assignment for each group and then computing the proportion of cases that are assigned to these clusters.
IndexCPP(I)
IndexCPP(I)
I |
A matrix where each row represents a case and each column represents a cluster assignment. The last column should indicate the group membership (1, 2, or 3). |
A list containing:
ICPP |
The Consistency Proportion Index. |
# Example usage set.seed(123) n <- 100 values1 <- sample(1:3, 30, replace = TRUE) values2 <- sample(1:3, 30, replace = TRUE) + 1 values3 <- sample(1:3, 40, replace = TRUE) + 2 values <- c(values1, values2, values3) categories <- c(rep(1, 30), rep(2, 30), rep(3, 40)) I <- cbind(1:n, values, categories) CPP <- IndexCPP(I) print(CPP)
# Example usage set.seed(123) n <- 100 values1 <- sample(1:3, 30, replace = TRUE) values2 <- sample(1:3, 30, replace = TRUE) + 1 values3 <- sample(1:3, 40, replace = TRUE) + 2 values <- c(values1, values2, values3) categories <- c(rep(1, 30), rep(2, 30), rep(3, 40)) I <- cbind(1:n, values, categories) CPP <- IndexCPP(I) print(CPP)
This function performs imputation using the K-Nearest Neighbors (KNN) algorithm and calculates various evaluation metrics including RMSE, MMAE, RRE, and Consistency Proportion Index (CPP) using different hierarchical clustering methods. It also records the execution time of the process.
KNN(data0, data.sample, data.copy, mr, km)
KNN(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
MMAE |
The Mean Absolute Error. |
RRE |
The Relative Eelative Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timeKNN |
The KNN algorithm execution time. |
This function performs mean imputation on a dataset with missing values. It replaces missing values with the column means and calculates various evaluation metrics including RMSE, MMAE, and RRE. Additionally, it performs k-means and hierarchical clustering to assess the quality of the imputation.
mean(data0, data.sample, data.copy, mr, km)
mean(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
MMAE |
The Mean Absolute Error. |
RRE |
The Relative Eelative Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timemean |
The mean algorithm execution time. |
kmeans
in the stats package for more information on k-means clustering.
hclust
in the stats package for more information on hierarchical clustering.
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform mean imputation result <- mean(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform mean imputation result <- mean(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)
This function performs Multilinear Principal Component Analysis (MLPCA) to handle missing data by imputing the missing values based on the correlation structure within the data. It also calculates the RMSE and Consistency Proportion Index (CPP) using different hierarchical clustering methods.
MLPCA(data0, data.sample, data.copy, mr, km)
MLPCA(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timeKNN |
The MLPCA algorithm execution time. |
princomp
and svd
for more information on PCA and SVD.
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform MLPCA imputation result <- MLPCA(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$CPP1) print(result$Xnew)
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform MLPCA imputation result <- MLPCA(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$CPP1) print(result$Xnew)
This function performs the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm to handle missing data by imputing the missing values based on the correlation structure within the data. It also calculates the RMSE and Consistency Proportion Index (CPP) using different hierarchical clustering methods.
NIPALS(data0, data.sample, data.copy, mr, km)
NIPALS(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timeNIPALS |
The NIPALS algorithm execution time. |
princomp
and svd
for more information on PCA and SVD.
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform NIPALS imputation result <- NIPALS(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$CPP1) print(result$Xnew)
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform NIPALS imputation result <- NIPALS(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$CPP1) print(result$Xnew)
This function performs Robust Principal Component Analysis (RPCA) to handle missing data by imputing the missing values based on the correlation structure within the data. It also calculates various evaluation metrics including RMSE, MMAE, RRE, and Consistency Proportion Index (CPP) using different hierarchical clustering methods.
RPCA(data0, data.sample, data.copy, mr, km)
RPCA(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
MMAE |
The Mean Absolute Error. |
RRE |
The Relative Relative Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timeRPCA |
The RPCA algorithm execution time. |
princomp
and svd
for more information on PCA and SVD.
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform RPCA imputation result <- RPCA(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform RPCA imputation result <- RPCA(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)
This function performs imputation using Singular Value Decomposition (SVD) and calculates various evaluation metrics including RMSE, MMAE, RRE, and Consistency Proportion Index (CPP) using different hierarchical clustering methods.
SVD(data0, data.sample, data.copy, mr, km)
SVD(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
MMAE |
The Mean Absolute Error. |
RRE |
The Relative Eelative Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timeSVD |
The SVD algorithm execution time. |
princomp
and svd
for more information on PCA and SVD.
This function performs imputation using Singular Value Decomposition (SVD) with iterative refinement. It begins by filling missing values with the mean of their respective columns. Then, it computes a low-rank (k) approximation of the data matrix. Using this approximation, it refills the missing values. This process of recomputing the rank-k approximation with the newly imputed values and refilling the missing data is repeated for a specified number of iterations, 'num.iters'.
SVDImpute(x, k, num.iters = 10, verbose = TRUE)
SVDImpute(x, k, num.iters = 10, verbose = TRUE)
x |
A data frame or matrix where each row represents a different record. |
k |
The rank-k approximation to use for the data matrix. |
num.iters |
The number of times to compute the rank-k approximation and impute the missing data. |
verbose |
If TRUE, print status updates during the process. |
A list containing:
data.matrix |
The imputed matrix with missing values filled. |
# Create a sample matrix with random values and introduce missing values x = matrix(rnorm(100), 10, 10) x[x > 1] = NA # Perform SVD imputation imputed_x = SVDImpute(x, 3) # Print the imputed matrix print(imputed_x)
# Create a sample matrix with random values and introduce missing values x = matrix(rnorm(100), 10, 10) x[x > 1] = NA # Perform SVD imputation imputed_x = SVDImpute(x, 3) # Print the imputed matrix print(imputed_x)
This function performs Trimmed Scores Regression (TSR) to handle missing data by imputing the missing values based on the correlation structure within the data. It also calculates various evaluation metrics including RMSE, MMAE, RRE, and Consistency Proportion Index (CPP) using different hierarchical clustering methods.
TSR(data0, data.sample, data.copy, mr, km)
TSR(data0, data.sample, data.copy, mr, km)
data0 |
The original dataset containing the response variable and features. |
data.sample |
The dataset used for sampling, which may contain missing values. |
data.copy |
A copy of the original dataset, used for comparison or validation. |
mr |
Indices of the rows with missing values that need to be predicted. |
km |
The number of clusters for k-means clustering. |
A list containing:
Xnew |
The imputed dataset. |
RMSE |
The Root Mean Squared Error. |
MMAE |
The Mean Absolute Error. |
RRE |
The Relative Relative Error. |
CPP1 |
The K-means clustering Consistency Proportion Index. |
CPP2 |
The Hierarchical Clustering Complete Linkage Consistency Proportion Index. |
CPP3 |
The Hierarchical Clustering Single Linkage Consistency Proportion Index. |
CPP4 |
The Hierarchical Clustering Average Linkage Consistency Proportion Index. |
CPP5 |
The Hierarchical Clustering Centroid linkage Consistency Proportion Index. |
CPP6 |
The Hierarchical Clustering Median Linkage Consistency Proportion Index. |
CPP7 |
The Hierarchical Clustering Ward's Method Consistency Proportion Index. |
timeTSR |
The TSR algorithm execution time. |
princomp
and svd
for more information on PCA and SVD.
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform TSR imputation result <- TSR(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)
# Create a sample matrix with random values and introduce missing values set.seed(123) n <- 100 p <- 5 data.sample <- matrix(rnorm(n * p), nrow = n) data.sample[sample(1:(n*p), 20)] <- NA data.copy <- data.sample data0 <- data.frame(data.sample, response = rnorm(n)) mr <- sample(1:n, 10) # Sample rows for evaluation km <- 3 # Number of clusters # Perform TSR imputation result <- TSR(data0, data.sample, data.copy, mr, km) # Print the results print(result$RMSE) print(result$MMAE) print(result$RRE) print(result$CPP1) print(result$Xnew)