| Title: | Another Test of Association for Count Data |
|---|---|
| Description: | The Upsilon test assesses association among categorical variables against the null hypothesis of independence (Luo 2021 MS thesis; ProQuest Publication No. 28649813). While promoting dominant function patterns, it demotes non-dominant function patterns. It is robust to low expected count---continuity correction like Yates's seems unnecessary. Using a common null population following a uniform distribution, contingency tables are comparable by statistical significance---not the case for most association tests defining a varying null population by tensor product of observed marginals. Although Pearson's chi-squared test, Fisher's exact test, and Woolf's G-test (related to mutual information) are useful in some contexts, the Upsilon test appeals to ranking association patterns not necessarily following same marginal distributions, such as in count data from DNA and RNA sequencing---a rapidly expanding frontier in modern science. |
| Authors: | Xuye Luo [aut], Joe Song [aut, cre] (ORCID: <https://orcid.org/0000-0002-6883-6547>) |
| Maintainer: | Joe Song <[email protected]> |
| License: | LGPL (>= 3) |
| Version: | 0.1.1 |
| Built: | 2026-06-05 06:59:05 UTC |
| Source: | https://github.com/cran/Upsilon |
Performs a fast zero-tolerant Pearson's chi-squared test (Pearson 1900) to evaluate association between observations from two categorical variables.
fast.chisq.test(x, y, log.p = FALSE)fast.chisq.test(x, y, log.p = FALSE)
x |
a vector to
specify observations of the first
categorical variable. The vector can be of
numeric, character, or logical type.
|
y |
a vector to specify observations of
the second categorical variable.
Must not contain |
log.p |
a logical. If |
A list with class "htest"
containing the following components:
statistic |
the value of chi-squared test statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
Cramér's V statistic representing the effect size. |
method |
a character string indicating the method used. |
data.name |
a character string giving the names of input data. |
The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.
Pearson K (1900). “X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. doi:10.1080/14786440009463897.
library("Upsilon") weather <- c( "rainy", "sunny", "rainy", "sunny", "rainy" ) mood <- c( "wistful", "upbeat", "upbeat", "upbeat", "wistful" ) fast.chisq.test(weather, mood) # The result is equivalent to: modified.chisq.test(table(weather, mood))library("Upsilon") weather <- c( "rainy", "sunny", "rainy", "sunny", "rainy" ) mood <- c( "wistful", "upbeat", "upbeat", "upbeat", "wistful" ) fast.chisq.test(weather, mood) # The result is equivalent to: modified.chisq.test(table(weather, mood))
Performs a fast zero-tolerant G-test (Woolf 1957) to evaluate association between observations from two categorical variables.
fast.gtest(x, y, log.p = FALSE)fast.gtest(x, y, log.p = FALSE)
x |
a vector to
specify observations of the first
categorical variable. The vector can be of
numeric, character, or logical type.
|
y |
a vector to specify observations of
the second categorical variable.
Must not contain |
log.p |
a logical. If |
A list with class "htest" containing the following components:
statistic |
the G-test statistic (Likelihood Ratio Chi-squared statistic). |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
the mutual information between the two variables. |
method |
a character string indicating the method used. |
data.name |
a character string giving the names of the data. |
The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.
Woolf B (1957). “The log likelihood ratio test (the G-test); methods and tables for tests of heterogeneity in contingency tables.” Annals of Human Genetics, 21(4), 397–409. doi:10.1111/j.1469-1809.1972.tb00293.x.
library("Upsilon") weather <- c( "rainy", "sunny", "rainy", "sunny", "rainy" ) mood <- c( "wistful", "upbeat", "upbeat", "upbeat", "wistful" ) fast.gtest(weather, mood) # The result is equivalent to: modified.gtest(table(weather, mood))library("Upsilon") weather <- c( "rainy", "sunny", "rainy", "sunny", "rainy" ) mood <- c( "wistful", "upbeat", "upbeat", "upbeat", "wistful" ) fast.gtest(weather, mood) # The result is equivalent to: modified.gtest(table(weather, mood))
Performs a fast Upsilon test (Luo 2021) to evaluate association between observations from two categorical variables.
fast.upsilon.test(x, y, log.p = FALSE)fast.upsilon.test(x, y, log.p = FALSE)
x |
a vector to
specify observations of the first
categorical variable. The vector can be of
numeric, character, or logical type.
|
y |
a vector to specify observations of
the second categorical variable.
Must not contain |
log.p |
a logical. If |
The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.
Null hypothesis (): Row and column variables are
statistically independent.
Null population: A discrete uniform distribution, where each entry in the table has the same probability.
Null distribution: The Upsilon test statistic
asymptotically follows a chi-squared distribution
with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom,
under the null hypothesis on the null population.
See (Luo 2021) for full details of the Upsilon test.
A list with class "htest" containing the following components:
statistic |
the Upsilon test statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
the effect size derived from the Upsilon statistic. |
method |
a character string indicating the method used. |
data.name |
a character string giving the name of input data. |
The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.
Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.
library("Upsilon") weather <- c( "rainy", "sunny", "rainy", "sunny", "rainy" ) mood <- c( "wistful", "upbeat", "upbeat", "upbeat", "wistful" ) fast.upsilon.test(weather, mood) # The result is equivalent to: upsilon.test(table(weather, mood))library("Upsilon") weather <- c( "rainy", "sunny", "rainy", "sunny", "rainy" ) mood <- c( "wistful", "upbeat", "upbeat", "upbeat", "wistful" ) fast.upsilon.test(weather, mood) # The result is equivalent to: upsilon.test(table(weather, mood))
Performs Pearson's chi-squared test (Pearson 1900) on contingency tables, slightly modified to handle rows or columns of all zeros.
modified.chisq.test(x, log.p = FALSE)modified.chisq.test(x, log.p = FALSE)
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
log.p |
a logical. If |
This test is useful if p-value must be returned
on a contingency table with valid non-negative counts,
where the build-in R implementation of
chisq.test could return NA
as p-value, regardless of a pattern being
strong or weak. See Examples.
Unlike chisq.test, this
function handles tables with empty rows or columns (where
expected values are 0) by calculating the test
statistic over non-zero entries only. This prevents
the result from becoming NA, while giving
meaningful p-values.
A list with class "htest" containing:
statistic |
the chi-squared test statistic (calculated ignoring entries of 0-expected count). |
parameter |
the degrees of freedom. |
p.value |
the p-value by the test. |
estimate |
Cramér's V statistic. |
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
This function only takes contingency table
as input. It does not support goodness-of-fit
test on vectors.
It does not offer an option
to apply Yates's continuity correction
on 2 2 tables.
Pearson K (1900). “X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. doi:10.1080/14786440009463897.
library("Upsilon") # A table with a dominant function and an empty column x <- matrix( c(0, 3, 0, 3, 0, 0), nrow = 2, byrow = TRUE) print(x) # Standard chisq.test fails or returns NA warning chisq.test(x) # Modified chi-squared test is significant: modified.chisq.test(x)library("Upsilon") # A table with a dominant function and an empty column x <- matrix( c(0, 3, 0, 3, 0, 0), nrow = 2, byrow = TRUE) print(x) # Standard chisq.test fails or returns NA warning chisq.test(x) # Modified chi-squared test is significant: modified.chisq.test(x)
Performs G-test (Woolf 1957) on contingency tables, slightly modified to handle rows or columns of all zeros.
modified.gtest(x, log.p = FALSE)modified.gtest(x, log.p = FALSE)
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
log.p |
a logical. If |
This test is useful if a p-value must be returned
on a contingency table with valid non-negative counts,
where other implementations of G-test could
return NA as the p-value, regardless of a
pattern being strong or weak.
This function handles tables with empty rows
or columns (where expected values are 0) by
calculating the test statistic over non-zero
entries only. This prevents the result from
becoming NA, while giving meaningful
p-values.
A list with class "htest" containing:
statistic |
the G statistic (log-likelihood ratio). |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
the value of mutual information. |
method |
a character string indicating the method used. |
data.name |
a character string, name of the input data. |
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
Woolf B (1957). “The log likelihood ratio test (the G-test); methods and tables for tests of heterogeneity in contingency tables.” Annals of Human Genetics, 21(4), 397–409. doi:10.1111/j.1469-1809.1972.tb00293.x.
library("Upsilon") # Create a sparse table with empty rows/cols x <- matrix( c(0, 3, 0, 3, 0, 0), nrow = 2, byrow = TRUE ) print(x) # Perform the modified G-test modified.gtest(x)library("Upsilon") # Create a sparse table with empty rows/cols x <- matrix( c(0, 3, 0, 3, 0, 0), nrow = 2, byrow = TRUE ) print(x) # Perform the modified G-test modified.gtest(x)
Performs the Upsilon test to evaluate association among categorical variables represented by a contingency table.
upsilon.test(x, log.p = FALSE)upsilon.test(x, log.p = FALSE)
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
log.p |
a logical. If |
The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.
Null hypothesis (): Row and column variables are
statistically independent.
Null population: A discrete uniform distribution, where each entry in the table has the same probability.
Null distribution: The Upsilon test statistic
asymptotically follows a chi-squared distribution
with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom,
under the null hypothesis on the null population.
See (Luo 2021) for full details of the Upsilon test.
A list with class "htest" containing:
statistic |
the value of the Upsilon statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value. |
estimate |
the effect size. |
method |
a character string giving the test name. |
data.name |
a character string giving the name of input data. |
observed |
the observed counts, a matrix copy of the input data. |
expected |
the expected counts under the null hypothesis using the observed marginals. |
Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.
library("Upsilon") # A contingency table with independent row and column variables x <- matrix( c(1, 1, 0, 1, 1, 0, 1, 1, 0), nrow = 3, byrow = TRUE ) print(x) upsilon.test(x) # A contingency table with a non-dominant function x <- matrix( c(4, 0, 0, 0, 1, 0, 0, 0, 1), nrow = 3, byrow = TRUE ) print(x) upsilon.test(x) # A contingency table with a dominant function x <- matrix( c(2, 0, 0, 0, 2, 0, 0, 0, 2), nrow = 3, byrow = TRUE) print(x) upsilon.test(x) # Another contingency table with a dominant function x <- matrix( c(3, 0, 0, 0, 3, 0, 0, 0, 0), nrow = 3, byrow = TRUE) print(x) upsilon.test(x)library("Upsilon") # A contingency table with independent row and column variables x <- matrix( c(1, 1, 0, 1, 1, 0, 1, 1, 0), nrow = 3, byrow = TRUE ) print(x) upsilon.test(x) # A contingency table with a non-dominant function x <- matrix( c(4, 0, 0, 0, 1, 0, 0, 0, 1), nrow = 3, byrow = TRUE ) print(x) upsilon.test(x) # A contingency table with a dominant function x <- matrix( c(2, 0, 0, 0, 2, 0, 0, 0, 2), nrow = 3, byrow = TRUE) print(x) upsilon.test(x) # Another contingency table with a dominant function x <- matrix( c(3, 0, 0, 0, 3, 0, 0, 0, 0), nrow = 3, byrow = TRUE) print(x) upsilon.test(x)