Title: | Graph-Based Two-Sample Tests |
---|---|
Description: | Four graph-based tests are provided for testing whether two samples are from the same distribution. It works for both continuous data and discrete data. |
Authors: | Hao Chen and Jingru Zhang |
Maintainer: | Hao Chen <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2 |
Built: | 2024-12-11 07:06:56 UTC |
Source: | CRAN |
This is a K by 2 matrix, where K is the number of distinct values. It specifies the counts in the K distinct values for the two samples. The data is generated from two samples with mean shift.
This is a K by 2 matrix, where K is the number of distinct values. It specifies the counts in the K distinct values for the two samples. The data is generated from two samples with spread difference.
This is a K by 2 matrix, where K is the number of distinct values. It specifies the counts in the K distinct values for the two samples. The data is generated from two samples with mean shift and spread difference.
One starts at the root and explores as far as possible along each branch before backtracking.
dfs(s,visited,adj)
dfs(s,visited,adj)
s |
The root node. |
visited |
N by 1 vector, where N is the number of nodes. This vector records whether nodes have been visited or not with 1 if visited and 0 otherwise. |
adj |
N by N adjacent matrix. |
This is a K by K matrix, which is the distance matrix on the distinct values for counts1.
This is a K by K matrix, which is the distance matrix on the distinct values for counts2.
This is a K by K matrix, which is the distance matrix on the distinct values for counts3.
This is a matrix with the number of rows the number of edges in the similarity graph and 2 columns. Each row records the subject indices of the two edges of in the similarity graph. The subject indices of sample 1 is 1:100, and the subject indices of sample 2 is 101:250.
This is a matrix with the number of rows the number of edges in the similarity graph and 2 columns. Each row records the subject indices of the two edges of in the similarity graph. The subject indices of sample 1 is 1:100, and the subject indices of sample 2 is 101:250.
This is a matrix with the number of rows the number of edges in the similarity graph and 2 columns. Each row records the subject indices of the two edges of in the similarity graph. The subject indices of sample 1 is 1:100, and the subject indices of sample 2 is 101:250.
This function provides four graph-based two-sample tests.
g.tests(E, sample1ID, sample2ID, test.type="all", maxtype.kappa = 1.14, perm=0)
g.tests(E, sample1ID, sample2ID, test.type="all", maxtype.kappa = 1.14, perm=0)
E |
An edge matrix representing a similarity graph with the number of edges in the similarity graph being the number of rows and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
sample1ID |
The subject indices of sample 1. |
sample2ID |
The subject indices of sample 2. |
test.type |
The default value is "all", which means all four tests are performed: orignial edge-count test (Friedman and Rafsky (1979)), generalized edge-count test (Chen and Friedman (2016)), weighted edge-count test (Chen, Chen and Su (2016)) and maxtype edge-count tests (Zhang and Chen (2017)). Set this value to "original" or "o" to permform only the original edge-count test; set this value to "generalized" or "g" to perform only the generalized edge-count test; set this value to "weighted" or "w" to perform only the weighted edge-count test; and set this value to "maxtype" or "m" to perform only the maxtype edge-count tests. |
maxtype.kappa |
The value of parameter(kappa) in the maxtype edge-count tests. The default value is 1.14. |
perm |
The number of permutations performed to calculate the p-value of the test. The default value is 0, which means the permutation is not performed and only approximate p-value based on asymptotic theory is provided. Doing permutation could be time consuming, so be cautious if you want to set this value to be larger than 10,000. |
test.statistic |
The test statistic. |
pval.approx |
The approximated p-value based on asymptotic theory. |
pval.perm |
The permutation p-value when argument 'perm' is positive. |
Friedman J. and Rafsky L. Multivariate generalizations of the WaldWolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4):697-717, 1979.
Chen, H. and Friedman, J. H. A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association, 2016.
Chen, H., Chen, X. and Su, Y. A weighted edge-count two sample test for multivariate and object data. Journal of the American Statistical Association, 2017.
Zhang, J. and Chen, H. Graph-based two-sample tests for discrete data.
# the "example" data contains three similarity graphs represted in the matrix form: E1, E2, E3. data(example) # E1 is an edge matrix representing a similarity graph. # It is constructed on two samples with mean difference. # Sample 1 indices: 1:100; sample 2 indices: 101:250. g.tests(E1, 1:100, 101:250) # E2 is an edge matrix representing a similarity graph. # It is constructed on two samples with variance difference. # Sample 1 indices: 1:100; sample 2 indices: 101:250. g.tests(E2, 1:100, 101:250) # E3 is an edge matrix representing a similarity graph. # It is constructed on two samples with mean and variance difference. # Sample 1 indices: 1:100; sample 2 indices: 101:250. g.tests(E3, 1:100, 101:250) ## Uncomment the following line to get permutation p-value with 200 permutations. # g.tests(E1, 1:100, 101:250, perm=200)
# the "example" data contains three similarity graphs represted in the matrix form: E1, E2, E3. data(example) # E1 is an edge matrix representing a similarity graph. # It is constructed on two samples with mean difference. # Sample 1 indices: 1:100; sample 2 indices: 101:250. g.tests(E1, 1:100, 101:250) # E2 is an edge matrix representing a similarity graph. # It is constructed on two samples with variance difference. # Sample 1 indices: 1:100; sample 2 indices: 101:250. g.tests(E2, 1:100, 101:250) # E3 is an edge matrix representing a similarity graph. # It is constructed on two samples with mean and variance difference. # Sample 1 indices: 1:100; sample 2 indices: 101:250. g.tests(E3, 1:100, 101:250) ## Uncomment the following line to get permutation p-value with 200 permutations. # g.tests(E1, 1:100, 101:250, perm=200)
This function provides four graph-based two-sample tests for discrete data.
g.tests_discrete(E, counts, test.type = "all", maxtype.kappa = 1.14, perm = 0)
g.tests_discrete(E, counts, test.type = "all", maxtype.kappa = 1.14, perm = 0)
E |
An edge matrix representing a similarity graph on the distinct values with the number of edges in the similarity graph being the number of rows and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
counts |
A K by 2 matrix, where K is the number of distinct values. It specifies the counts in the K distinct values for the two samples. |
test.type |
The default value is "all", which means all four tests are performed: the orignial edge-count test (Chen and Zhang (2013)), extension of the generalized edge-count test (Chen and Friedman (2016)), extension of the weighted edge-count test (Chen, Chen and Su (2016)) and extension of the maxtype edge-count tests (Zhang and Chen (2017)). Set this value to "original" or "o" to permform only the original edge-count test; set this value to "generalized" or "g" to perform only extension of the generalized edge-count test; set this value to "weighted" or "w" to perform only extension of the weighted edge-count test; and set this value to "maxtype" or "m" to perform only extension of the maxtype edge-count tests. |
maxtype.kappa |
The value of parameter(kappa) in the extension of the maxtype edge-count tests. The default value is 1.14. |
perm |
The number of permutations performed to calculate the p-value of the test. The default value is 0, which means the permutation is not performed and only approximate p-value based on asymptotic theory is provided. Doing permutation could be time consuming, so be cautious if you want to set this value to be larger than 10,000. |
test.statistic_a |
The test statistic using 'average' method to construct the graph. |
test.statistic_u |
The test statistic using 'union' method to construct the graph. |
pval.approx_a |
Using 'average' method to construct the graph, the approximated p-value based on asymptotic theory. |
pval.approx_u |
Using 'union' method to construct the graph, the approximated p-value based on asymptotic theory. |
pval.perm_a |
Using 'average' method to construct the graph, the permutation p-value when argument 'perm' is positive. |
pval.perm_u |
Using 'union' method to construct the graph, the permutation p-value when argument 'perm' is positive. |
Friedman J. and Rafsky L. Multivariate generalizations of the WaldWolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4):697-717, 1979.
Chen, H. and Zhang, N. R. Graph-based tests for two-sample comparisons of categorical data. Statistica Sinica, 2013.
Chen, H. and Friedman, J. H. A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association, 2016.
Chen, H., Chen, X. and Su, Y. A weighted edge-count two sample test for multivariate and object data. Journal of the American Statistical Association, 2017.
Zhang, J. and Chen, H. Graph-based two-sample tests for discrete data.
# the "example_discrete" data contains three two-sample counts data # represted in the matrix form: counts1, counts2, counts3 # and the corresponding distance matrix on the distinct values: ds1, ds2, ds3. data(example_discrete) # counts1 is a K by 2 matrix, where K is the number of distinct values. # It specifies the counts in the K distinct values for the two samples. # ds1 is the corresponding distance matrix on the distinct values. # The data is generated from two samples with mean shift. Knnl = 3 E1 = getGraph(counts1, ds1, Knnl, graph = "nnlink") g.tests_discrete(E1, counts1) # counts2 is a K by 2 matrix, where K is the number of distinct values. # It specifies the counts in the K distinct values for the two samples. # ds2 is the corresponding distance matrix on the distinct values. # The data is generated from two samples with spread difference. Kmst = 6 E2 = getGraph(counts2, ds2, Kmst, graph = "mstree") g.tests_discrete(E2, counts2) # counts3 is a K by 2 matrix, where K is the number of distinct values. # It specifies the counts in the K distinct values for the two samples. # ds3 is the corresponding distance matrix on the distinct values. # The data is generated from two samples with mean shift and spread difference. Knnl = 3 E3 = getGraph(counts3, ds3, Knnl, graph = "nnlink") g.tests_discrete(E3, counts3) ## Uncomment the following line to get permutation p-value with 200 permutations. # Knnl = 3 # E1 = getGraph(counts1, ds1, Knnl, graph = "nnlink") # g.tests_discrete(E1, counts1, test.type = "all", maxtype.kappa = 1.31, perm = 300)
# the "example_discrete" data contains three two-sample counts data # represted in the matrix form: counts1, counts2, counts3 # and the corresponding distance matrix on the distinct values: ds1, ds2, ds3. data(example_discrete) # counts1 is a K by 2 matrix, where K is the number of distinct values. # It specifies the counts in the K distinct values for the two samples. # ds1 is the corresponding distance matrix on the distinct values. # The data is generated from two samples with mean shift. Knnl = 3 E1 = getGraph(counts1, ds1, Knnl, graph = "nnlink") g.tests_discrete(E1, counts1) # counts2 is a K by 2 matrix, where K is the number of distinct values. # It specifies the counts in the K distinct values for the two samples. # ds2 is the corresponding distance matrix on the distinct values. # The data is generated from two samples with spread difference. Kmst = 6 E2 = getGraph(counts2, ds2, Kmst, graph = "mstree") g.tests_discrete(E2, counts2) # counts3 is a K by 2 matrix, where K is the number of distinct values. # It specifies the counts in the K distinct values for the two samples. # ds3 is the corresponding distance matrix on the distinct values. # The data is generated from two samples with mean shift and spread difference. Knnl = 3 E3 = getGraph(counts3, ds3, Knnl, graph = "nnlink") g.tests_discrete(E3, counts3) ## Uncomment the following line to get permutation p-value with 200 permutations. # Knnl = 3 # E1 = getGraph(counts1, ds1, Knnl, graph = "nnlink") # g.tests_discrete(E1, counts1, test.type = "all", maxtype.kappa = 1.31, perm = 300)
This function calculates the distance between two components.
getComdist(g1,g2,distance)
getComdist(g1,g2,distance)
g1 |
The distinct values in Component 1. |
g2 |
The distinct values in Component 2. |
distance |
A K by K matrix, which is the distance matrix on the distinct values and K is the number of distinct values with at least one observation in either group. |
This function provides two methods to construct the similarity graph.
getGraph(counts, mydist, K, graph.type = "mstree")
getGraph(counts, mydist, K, graph.type = "mstree")
counts |
A K by 2 matrix, where K is the number of distinct values. It specifies the counts in the K distinct values for the two samples. |
mydist |
A K by K matrix, which is the distance matrix on the distinct values. |
K |
Set the value of k in "k-MST" or "k-NNL" to construct the similarity graph. |
graph.type |
Specify the type of the constructing graph. The default value is "mstree", which means constructing the minimal spanning tree as the similarity graph. Set this value to "nnlink" to construct the similarity graph by the nearest neighbor link method. |
E |
An edge matrix representing a similarity graph on the distinct values with the number of edges in the similarity graph being the number of rows and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
This function calculates means and variances of R1 and R2 quantities using 'average' method and 'union' method to construct the graph.
getMV_discrete(E,vmat)
getMV_discrete(E,vmat)
E |
An edge matrix representing a similarity graph on the distinct values with the number of edges in the similarity graph being the number of rows and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
vmat |
A K by 2 matrix, where K is the number of distinct values with at least one observation in either group. It specifies the counts in the K distinct values for the two samples. |
This function calculates R1 and R2 quantities.
getR1R2(E, G1)
getR1R2(E, G1)
E |
A matrix with the number of rows the number of edges in the similarity graph and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
G1 |
The subject indices of sample 1. |
This function calculates R1 and R2 quantities using 'average' method and 'union' method to construct the graph.
getR1R2_discrete(E,vmat)
getR1R2_discrete(E,vmat)
E |
An edge matrix representing a similarity graph on the distinct values with the number of edges in the similarity graph being the number of rows and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
vmat |
A K by 2 matrix, where K is the number of distinct values with at least one observation in either group. It specifies the counts in the K distinct values for the two samples. |
This package includes four graph-based two-sample tests under the continuous setting and the discrete setting.
Hao Chen and Jingru Zhang
Maintainer: Hao Chen ([email protected])
Friedman J. and Rafsky L. (1979). Multivariate generalizations of the WaldWolfowitz and Smirnov two-sample tests. The Annals of Statistics 7(4):697-717.
Chen, H. and Zhang, N. R. (2013). Graph-based tests for two-sample comparisons of categorical data. Statistica Sinica 23:1479-1503.
Chen, H. and Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association, 112:517, 397-409.
Chen, H., Chen, X. and Su, Y. (2017). A weighted edge-count two sample test for multivariate and object data. Journal of the American Statistical Association.
Zhang, J. and Chen, H. (2017). Graph-based two-sample tests for discrete data. arXiv:1711.04349
g.tests
g.tests_discrete
getGraph
This function provides the edges of the similarity graph constructed by 1-NNL.
nnlink(distance)
nnlink(distance)
distance |
A K by K matrix, which is the distance matrix on the distinct values and K is the number of distinct values with at least one observation in either group. |
E |
An edge matrix representing a similarity graph on the distinct values with the number of edges in the similarity graph being the number of rows and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
This function obtains components based on the nearest neighbor link algorithm.
nnlink_Com(distance)
nnlink_Com(distance)
distance |
A K by K matrix, which is the distance matrix on the distinct values and K is the number of distinct values with at least one observation in either group. |
This function provides the edges of the similarity graph constructed by k-NNL.
nnlink_K(distance,K)
nnlink_K(distance,K)
distance |
A K by K matrix, which is the distance matrix on the distinct values and K is the number of distinct values with at least one observation in either group. |
K |
Set the value of k in "k-NNL" to construct the similarity graph. |
E |
An edge matrix representing a similarity graph on the distinct values with the number of edges in the similarity graph being the number of rows and 2 columns. Each row records the subject indices of the two ends of an edge in the similarity graph. |
This function permutes the observations maintaining the two sample sizes unchaged.
permute_discrete(vmat)
permute_discrete(vmat)
vmat |
A K by 2 matrix, where K is the number of distinct values with at least one observation in either group. It specifies the counts in the K distinct values for the two samples. |