Title: | Clustering and Classification Inference with U-Statistics |
---|---|
Description: | Clustering and classification inference for high dimension low sample size (HDLSS) data with U-statistics. The package contains implementations of nonparametric statistical tests for sample homogeneity, group separation, clustering, and classification of multivariate data. The methods have high statistical power and are tailored for data in which the dimension L is much larger than sample size n. See Gabriela B. Cybis, Marcio Valk and Sílvia RC Lopes (2018) <doi:10.1080/00949655.2017.1374387>, Marcio Valk and Gabriela B. Cybis (2020) <doi:10.1080/10618600.2020.1796398>, Debora Z. Bello, Marcio Valk and Gabriela B. Cybis (2021) <arXiv:2106.09115>. |
Authors: | Gabriela Cybis [aut, cre], Marcio Valk [aut], Kazuki Yokoyama [ctb], Debora Zava Bello [ctb] |
Maintainer: | Gabriela Cybis <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2024-12-16 06:55:12 UTC |
Source: | CRAN |
Returns the value for the Bn statistic that measures the degree of separation between two groups. The statistic is computed through the difference of average within group distances to average between group distances. Large values of Bn indicate large group separation. Under overall sample homogeneity we have E(Bn)=0.
bn(group_id, md = NULL, data = NULL)
bn(group_id, md = NULL, data = NULL)
group_id |
A vector of 0s and 1s indicating to which group the samples belong. Must be in the same order as data or md. |
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
Either data
OR md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance, which is compatible with
is_homo
, uclust
and uhclust
.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
Value of the Bn statistic.
n=5 x=matrix(rnorm(n*10),ncol=10) bn(c(1,0,0,0,0),data=x) # option (a) entering the data matrix directly md=as.matrix(dist(x))^2 bn(c(0,1,1,1,1),md) # option (b) entering the distance matrix
n=5 x=matrix(rnorm(n*10),ncol=10) bn(c(1,0,0,0,0),data=x) # option (a) entering the data matrix directly md=as.matrix(dist(x))^2 bn(c(0,1,1,1,1),md) # option (b) entering the distance matrix
Returns the value for the Bn statistic that measures the degree of separation between three groups. The statistic is computed as a combination of differences of average within group and between group distances. Large values of Bn indicate large group separation. Under overall sample homogeneity we have E(Bn)=0.
bn3(group_id, md = NULL, data = NULL)
bn3(group_id, md = NULL, data = NULL)
group_id |
A vector of 1s, 2s and 3s indicating to which group the samples belong. Must be in the same order as data or md. |
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
Either data
OR md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
For more detail see Bello, Debora Zava, Marcio Valk and Gabriela Bettella Cybis. "Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
Value of the Bn3 statistic.
n=7 set.seed(1234) x=matrix(rnorm(n*10),ncol=10) bn3(c(1,2,2,2,3,3,3),data=x) # option (a) entering the data matrix directly md=as.matrix(dist(x))^2 bn3(c(1,2,2,2,3,3,3),md) # option (b) entering the distance matrix
n=7 set.seed(1234) x=matrix(rnorm(n*10),ncol=10) bn3(c(1,2,2,2,3,3,3),data=x) # option (a) entering the data matrix directly md=as.matrix(dist(x))^2 bn3(c(1,2,2,2,3,3,3),md) # option (b) entering the distance matrix
Homogeneity test based on the statistic bn
. The test assesses whether there exists a data partition
for which group separation is statistically significant according to the U-test. The null hypothesis
is overall sample homogeneity, and a sample is considered homogeneous if it cannot be divided into
two statistically significant subgroups.
is_homo(md = NULL, data = NULL, rep = 10)
is_homo(md = NULL, data = NULL, rep = 10)
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
rep |
Number of times to repeat optimization procedure. Important for problems with multiple optima. |
This is the homogeneity test of Cybis et al. (2017) extended to account for groups of size 1. The test is performed through two steps: an optimization procedure that finds the data partition that maximizes the standardized Bn and a test for the resulting maximal partition. Should be used in high dimension small sample size settings.
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn
is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
Returns a list with the following elements:
Test statistic. Minimum of the objective function for optimization (-stdBn).
Elements in group 1 in the maximal partition. (obs: this is not the best
partition for the data, see uclust
)
Elements in group 2 in the maximal partition.
P-value for the homogeneity test.
Values for the minimum objective function on all rep
optimization runs.
Resampling variance estimate for partitions with groups of size n/2 (or (n-1)/2 and (n+1)/2 if n is odd).
Resampling variance estimate for partitions with one group of size 1.
x = matrix(rnorm(500000),nrow=50) #creating homogeneous Gaussian dataset res = is_homo(data=x) x[1:30,] = x[1:30,]+0.15 #Heterogeneous dataset (first 30 samples have different mean) res = is_homo(data=x) md = as.matrix(dist(x)^2) #squared Euclidean distances for the same data res = is_homo(md) # Multidimensional sacling plot of distance matrix fit <- cmdscale(md, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))
x = matrix(rnorm(500000),nrow=50) #creating homogeneous Gaussian dataset res = is_homo(data=x) x[1:30,] = x[1:30,]+0.15 #Heterogeneous dataset (first 30 samples have different mean) res = is_homo(data=x) md = as.matrix(dist(x)^2) #squared Euclidean distances for the same data res = is_homo(md) # Multidimensional sacling plot of distance matrix fit <- cmdscale(md, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))
Homogeneity test based on the statistic bn3. The test assesses whether there exists a data partition for which three group separation is statistically significant according to utest3. The null hypothesis is overall sample homogeneity, and a sample is considered homogeneous if it cannot be divided into three groups with at least one significantly different from the others.
is_homo3(md = NULL, data = NULL, rep = 20, test_max = TRUE, alpha = 0.05)
is_homo3(md = NULL, data = NULL, rep = 20, test_max = TRUE, alpha = 0.05)
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
rep |
Number of times to repeat optimization procedure. Important for problems with multiple optima. |
test_max |
Logical indicating whether to employ the max test |
alpha |
Significance level |
This is the homogeneity test of Bello et al. (2021). The test is performed through two steps: an optimization procedure that finds the data partition that maximizes the standardized Bn and a test for the resulting maximal partition. Should be used in high dimension small sample size settings.
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn
is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Bello, Debora Zava, Marcio Valk and Gabriela Bettella Cybis. "Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
Returns a list with the following elements:
Test statistic. Maximum standardized Bn.
Elements in group 1 in the maximal partition. (obs: this is not the best
partition for the data, see uclust3
)
Elements in group 2 in the maximal partition.
Elements in group 3 in the maximal partition.
P-value for the homogeneity test.
Alpha after Bonferroni correction
Resampling variance estimate for partitions with central group sizes.
Resampling variance estimate for partitions with one group of size 1.
Estimated variance of Bn for maximal standardized Bn configuration.
set.seed(123) x = matrix(rnorm(70000),nrow=7) #creating homogeneous Gaussian dataset res = is_homo3(data=x) res #uncomment to run # x = matrix(rnorm(18000),nrow=18) # x[1:5,] = x[1:5,]+0.5 #Heterogeneous dataset (first 5 samples have different mean) # x[6:9,] = x[6:9,]+1.5 # res = is_homo3(data=x) # res # md = as.matrix(dist(x)^2) #squared Euclidean distances for the same data # res = is_homo3(md) # uncomment to run # Multidimensional sacling plot of distance matrix #fit <- cmdscale(md, eig = TRUE, k = 2) #x <- fit$points[, 1] #y <- fit$points[, 2] #plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))
set.seed(123) x = matrix(rnorm(70000),nrow=7) #creating homogeneous Gaussian dataset res = is_homo3(data=x) res #uncomment to run # x = matrix(rnorm(18000),nrow=18) # x[1:5,] = x[1:5,]+0.5 #Heterogeneous dataset (first 5 samples have different mean) # x[6:9,] = x[6:9,]+1.5 # res = is_homo3(data=x) # res # md = as.matrix(dist(x)^2) #squared Euclidean distances for the same data # res = is_homo3(md) # uncomment to run # Multidimensional sacling plot of distance matrix #fit <- cmdscale(md, eig = TRUE, k = 2) #x <- fit$points[, 1] #y <- fit$points[, 2] #plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))
This function plots the p-value annotated dendrogram resulting from uhclust
plot_uhclust( uhclust, pvalues_cex = 0.8, pvalues_dx = 2, pvalues_dy = 0.08, print_pvalues = TRUE )
plot_uhclust( uhclust, pvalues_cex = 0.8, pvalues_dx = 2, pvalues_dy = 0.08, print_pvalues = TRUE )
uhclust |
Result from |
pvalues_cex |
Graphical parameter for p-value font size. |
pvalues_dx |
Graphical parameter for p-value position shift on x axis. |
pvalues_dy |
Graphical parameter for p-value position shift on y axis. |
print_pvalues |
Logical. Should the p-values be printed? |
x = matrix(rnorm(100000),nrow=50) x[1:35,] = x[1:35,]+0.7 x[1:15,] = x[1:15,]+0.4 res = uhclust(data=x, plot=FALSE) plot_uhclust(res)
x = matrix(rnorm(100000),nrow=50) x[1:35,] = x[1:35,]+0.7 x[1:15,] = x[1:15,]+0.4 res = uhclust(data=x, plot=FALSE) plot_uhclust(res)
Simple print method for utest_classify objects.
## S3 method for class 'utest_classify' print(x, ...)
## S3 method for class 'utest_classify' print(x, ...)
x |
utest_classify object |
... |
additional parameters passed to the function |
Finds the configuration with max Bn among all configurations.
rep_optimBn(mdm, rep = 15, bootB = -1)
rep_optimBn(mdm, rep = 15, bootB = -1)
mdm |
Matrix of squared Euclidean distances between all data points. |
rep |
Number of replications |
bootB |
Result of previous bootstrap (if available). If, -1, a new bootstrap is performed for the variance of Bn. |
Partitions the sample into the two significant subgroups with the largest Bn statistic. If no significant partition exists, the test will return "homogeneous".
uclust(md = NULL, data = NULL, alpha = 0.05, rep = 15)
uclust(md = NULL, data = NULL, alpha = 0.05, rep = 15)
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
alpha |
Significance level. |
rep |
Number of times to repeat optimization procedures. Important for problems with multiple optima. |
This is the significance clustering procedure of Valk and Cybis (2018).
The method first performs a homogeneity test to verify whether the data can be significantly
partitioned. If the hypothesis of homogeneity is rejected, then the method will search, among all
the significant partitions, for the partition that better separates the data, as measured by larger
bn
statistic. This function should be used in high dimension small sample size settings.
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn
is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics."
Journal of Statistical Computation and Simulation 88.10 (2018)
and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
See also is_homo
, uhclust
, Utest_class
.
Returns a list with the following elements:
Elements in group 1 in the final partition. This is the significant partition with maximal Bn, if sample is heterogeneous.
Elements in group 2 in the final partition.
P-value for the test that renders the final partition, if heterogeneous. Homogeneity test p-value, if homogeneous.
Bonferroni corrected significance level for the test that renders the final partition, if heterogeneous. Homogeneity test significance level, if homogeneous.
Size of the smallest cluster
Logical, returns TRUE
when the sample is homogeneous.
Value of Bn statistic for the final partition, if heterogeneous. Value of Bn statistic for the maximal homogeneity test partition, if homogeneous.
Variance estimate for final partition, if heterogeneous. Variance estimate for the maximal homogeneity test partition, if homogeneous.
Result of homogeneity test (see is_homo
).
set.seed(17161) x = matrix(rnorm(100000),nrow=50) #creating homogeneous Gaussian dataset res = uclust(data=x) x[1:30,] = x[1:30,]+0.25 #Heterogeneous dataset (first 30 samples have different mean) res = uclust(data=x) md = as.matrix(dist(x)^2) #squared Euclidean distances for the same data res = uclust(md) # Multidimensional scaling plot of distance matrix fit <- cmdscale(md, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] col=rep(3,dim(md)[1]) col[res$cluster2]=2 plot(x,y, main=paste("Multidimensional scaling plot of data: homogeneity p-value =",res$ishomoResult$p.MaxTest),col=col)
set.seed(17161) x = matrix(rnorm(100000),nrow=50) #creating homogeneous Gaussian dataset res = uclust(data=x) x[1:30,] = x[1:30,]+0.25 #Heterogeneous dataset (first 30 samples have different mean) res = uclust(data=x) md = as.matrix(dist(x)^2) #squared Euclidean distances for the same data res = uclust(md) # Multidimensional scaling plot of distance matrix fit <- cmdscale(md, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] col=rep(3,dim(md)[1]) col[res$cluster2]=2 plot(x,y, main=paste("Multidimensional scaling plot of data: homogeneity p-value =",res$ishomoResult$p.MaxTest),col=col)
Partitions data into three groups only when these partitions are statistically significant. If no significant partition exists, the test will return "homogeneous".
uclust3(md = NULL, data = NULL, alpha = 0.05, rep = 15)
uclust3(md = NULL, data = NULL, alpha = 0.05, rep = 15)
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
alpha |
Significance level. |
rep |
Number of times to repeat optimization procedures. Important for problems with multiple optima. |
This is the significance clustering procedure of Bello et al. (2021).
The method first performs a homogeneity test to verify whether the data can be significantly
partitioned. If the hypothesis of homogeneity is rejected, then the method will search, among all
the significant partitions, for the partition that better separates the data, as measured by larger
bn
statistic. This function should be used in high dimension small sample size settings.
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn
is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see
Bello, Debora Zava, Marcio Valk and Gabriela Bettella Cybis.
"Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
See also is_homo3
, uclust
.
Returns a list with the following elements:
List with elements of final three groups
P-value for the test that renders the final partition, if heterogeneous. Homogeneity test p-value, if homogeneous.
Bonferroni corrected significance level for the test that renders the final partition, if heterogeneous. Homogeneity test significance level, if homogeneous.
Logical, returns TRUE
when the sample is homogeneous.
Value of Bn statistic for the final partition, if heterogeneous. Value of Bn statistic for the maximal homogeneity test partition, if homogeneous.
Variance estimate for final partition, if heterogeneous. Variance estimate for the maximal homogeneity test partition, if homogeneous.
set.seed(123) x = matrix(rnorm(70000),nrow=7) #creating homogeneous Gaussian dataset res = uclust3(data=x) res # uncomment to run # x = matrix(rnorm(15000),nrow=15) # x[1:6,] = x[1:6,]+1.5 #Heterogeneous dataset (first 5 samples have different mean) # x[7:12,] = x[7:12,]+3 # res = uclust3(data=x) # res$groups
set.seed(123) x = matrix(rnorm(70000),nrow=7) #creating homogeneous Gaussian dataset res = uclust3(data=x) res # uncomment to run # x = matrix(rnorm(15000),nrow=15) # x[1:6,] = x[1:6,]+1.5 #Heterogeneous dataset (first 5 samples have different mean) # x[7:12,] = x[7:12,]+3 # res = uclust3(data=x) # res$groups
Hierarchical clustering method that partitions the data only when these partitions are statistically significant.
uhclust(md = NULL, data = NULL, alpha = 0.05, rep = 15, plot = TRUE)
uhclust(md = NULL, data = NULL, alpha = 0.05, rep = 15, plot = TRUE)
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
alpha |
Significance level. |
rep |
Number of times to repeat optimization procedures. Important for problems with multiple optima. |
plot |
Logical, |
This is the significance hierarchical clustering procedure of Valk and Cybis (2018). The data are
repeatedly partitioned into two subgroups, through function uclust
, according to a hierarchical scheme.
The procedure stops when resulting subgroups are homogeneous or have fewer than 3 elements.
This function should be used in high dimension small sample size settings.
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn
is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
See also is_homo
, uclust
and Utest_class
.
Returns an object of class hclust
with three additional attribute arrays:
P-values from uclust for the final data partition at each node of the dendrogram. This
array is in the same order of height
, and only contains values for tests that were performed.
Bonferroni corrected significance levels for uclust for the data partitions at each node
of the dendrogram. This array is in the same order of height
, and only contains values for tests that were performed.
Final group assignments.
x = matrix(rnorm(100000),nrow=50) #creating homogeneous Gaussian dataset res = uhclust(data=x) x[1:30,] = x[1:30,]+0.7 #Heterogeneous dataset x[1:10,] = x[1:10,]+0.4 res = uhclust(data=x) res$groups
x = matrix(rnorm(100000),nrow=50) #creating homogeneous Gaussian dataset res = uhclust(data=x) x[1:30,] = x[1:30,]+0.7 #Heterogeneous dataset x[1:10,] = x[1:10,]+0.4 res = uhclust(data=x) res$groups
Test for the separation of two groups. The null hypothesis states that the groups are homogeneous and the alternative hypothesis states that they are separate.
utest(group_id, md = NULL, data = NULL, numB = 1000)
utest(group_id, md = NULL, data = NULL, numB = 1000)
group_id |
A vector of 0s and 1s indicating to which group the samples belong. Must be in the same order as data or md. |
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
numB |
Number of resampling iterations. |
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean
distance, which is compatible with is_homo
, uclust
and
uhclust
.
For more details see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018)
Returns a list with the following elements:
Test Statistic
Replication based p-value
Number of replications used to compute p-value
# Simulate a dataset with two separate groups, the first 5 rows have mean 0 and # the last 5 rows have mean 5. data <- matrix(c(rnorm(75, 0), rnorm(75, 5)), nrow = 10, byrow=TRUE) # U test for mixed up groups utest(group_id=c(1,0,1,0,1,0,1,0,1,0), data=data, numB=3000) # U test for correct group definitions utest(group_id=c(1,1,1,1,1,0,0,0,0,0), data=data, numB=3000)
# Simulate a dataset with two separate groups, the first 5 rows have mean 0 and # the last 5 rows have mean 5. data <- matrix(c(rnorm(75, 0), rnorm(75, 5)), nrow = 10, byrow=TRUE) # U test for mixed up groups utest(group_id=c(1,0,1,0,1,0,1,0,1,0), data=data, numB=3000) # U test for correct group definitions utest(group_id=c(1,1,1,1,1,0,0,0,0,0), data=data, numB=3000)
The null hypothesis is that the new data is not well classified into the first group when compared to the second group. The alternative hypothesis is that the data is well classified into the first group.
utest_classify(x, data, group_id, bootstrap_iter = 1000)
utest_classify(x, data, group_id, bootstrap_iter = 1000)
x |
A numeric vector to be classified. |
data |
Data matrix. Each row represents an observation. |
group_id |
A vector of 0s (first group) and 1s indicating to which group the samples belong. Must be in the same order as data. |
bootstrap_iter |
Numeric scalar. The number of bootstraps. It's recommended
|
The test is performed considering the squared Euclidean distance.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." arXiv preprint arXiv:1805.12179 (2018).
A list with class "utest_classify" containing the following components:
statistic |
the value of the test statistic. |
p_value |
The p-value for the test. |
bootstrap_iter |
the number of bootstrap iterations. |
# Example 1 # Five observations from each group, G1 and G2. Each observation has 60 dimensions. data <- matrix(c(rnorm(300, 0), rnorm(300, 10)), ncol = 60, byrow=TRUE) # Test data comes from G1. x <- rnorm(60, 0) # The test correctly indicates that the test data should be classified into G1 (p < 0.05). utest_classify(x, data, group_id = c(rep(0,times=5),rep(1,times=5))) # Example 2 # Five observations from each group, G1 and G2. Each observation has 60 dimensions. data <- matrix(c(rnorm(300, 0), rnorm(300, 10)), ncol = 60, byrow=TRUE) # Test data comes from G2. x <- rnorm(60, 10) # The test correctly indicates that the test data should be classified into G2 (p > 0.05). utest_classify(x, data, group_id = c(rep(1,times=5),rep(0,times=5)))
# Example 1 # Five observations from each group, G1 and G2. Each observation has 60 dimensions. data <- matrix(c(rnorm(300, 0), rnorm(300, 10)), ncol = 60, byrow=TRUE) # Test data comes from G1. x <- rnorm(60, 0) # The test correctly indicates that the test data should be classified into G1 (p < 0.05). utest_classify(x, data, group_id = c(rep(0,times=5),rep(1,times=5))) # Example 2 # Five observations from each group, G1 and G2. Each observation has 60 dimensions. data <- matrix(c(rnorm(300, 0), rnorm(300, 10)), ncol = 60, byrow=TRUE) # Test data comes from G2. x <- rnorm(60, 10) # The test correctly indicates that the test data should be classified into G2 (p > 0.05). utest_classify(x, data, group_id = c(rep(1,times=5),rep(0,times=5)))
Test for the separation of three groups. The null hypothesis states that the groups are homogeneous and the alternative hypothesis states that at least one is separated from the others.
utest3(group_id, md = NULL, data = NULL, alpha = 0.05, numB = 1000)
utest3(group_id, md = NULL, data = NULL, alpha = 0.05, numB = 1000)
group_id |
A vector of 1s, 2s and 3s indicating to which group the samples belong. Must be in the same order as data or md. |
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
alpha |
Significance level |
numB |
Number of resampling iterations. |
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean
distance.
For more detail see Bello, Debora Zava, Marcio Valk and Gabriela Bettella Cybis. "Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
Returns a list with the following elements:
Logical of whether test indicates that data is homogeneous
Replication based p-value
Test Statistic
Standard error for Bn statistic computed through resampling
# Simulate a dataset with two separate groups, # the first row has mean -4, the next 5 rows have mean 0 and the last 5 rows have mean 4. data <- matrix(c(rnorm(15, -4),rnorm(75, 0), rnorm(75, 4)), nrow = 11, byrow=TRUE) # U test for mixed up groups utest3(group_id=c(1,2,3,1,2,3,1,2,3,1,2), data=data, numB=3000) # U test for correct group definitions utest3(group_id=c(1,2,2,2,2,2,3,3,3,3,3), data=data, numB=3000)
# Simulate a dataset with two separate groups, # the first row has mean -4, the next 5 rows have mean 0 and the last 5 rows have mean 4. data <- matrix(c(rnorm(15, -4),rnorm(75, 0), rnorm(75, 4)), nrow = 11, byrow=TRUE) # U test for mixed up groups utest3(group_id=c(1,2,3,1,2,3,1,2,3,1,2), data=data, numB=3000) # U test for correct group definitions utest3(group_id=c(1,2,2,2,2,2,3,3,3,3,3), data=data, numB=3000)
Estimates the variance of the Bn statistic using the resampling procedure described in Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
var_bn(group_sizes, md = NULL, data = NULL, numB = 2000)
var_bn(group_sizes, md = NULL, data = NULL, numB = 2000)
group_sizes |
A vector with two entries: size of group 1 and size of group 2. |
md |
Matrix of distances between all data points. |
data |
Data matrix. Each row represents an observation. |
numB |
Number of resampling iterations. Only used if no groups are of size 1. |
Either data
or md
should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean
distance, which is compatible with is_homo
, uclust
and
uhclust
.
Variance of Bn
n=5 x=matrix(rnorm(n*20),ncol=20) # option (a) entering the data matrix directly and considering a group of size 1 var_bn(c(1,4),data=x) # option (b) entering the distance matrix and considering a groups of size 2 and 3 md=as.matrix(dist(x))^2 var_bn(c(2,3),md)
n=5 x=matrix(rnorm(n*20),ncol=20) # option (a) entering the data matrix directly and considering a group of size 1 var_bn(c(1,4),data=x) # option (b) entering the distance matrix and considering a groups of size 2 and 3 md=as.matrix(dist(x))^2 var_bn(c(2,3),md)