Title: | Visualize and Improve Connectedness of Factors in Tables |
---|---|
Description: | Visualize the connectedness of factors in two-way tables. Perform two-way filtering to improve the degree of connectedness. See Weeks & Williams (1964) <doi:10.1080/00401706.1964.10490188>. |
Authors: | Kevin Wright [aut, cre]
|
Maintainer: | Kevin Wright <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1 |
Built: | 2025-03-05 15:27:23 UTC |
Source: | CRAN |
Multiple factors in a dataframe are said to be connected if a model matrix based on those factors is full rank.
This function provides a formula interface to the lfe::compfactor() function to check for connectedness of the factors.
con_check(data = NULL, formula = NULL, WW = TRUE, dropNA = TRUE)
con_check(data = NULL, formula = NULL, WW = TRUE, dropNA = TRUE)
data |
A dataframe |
formula |
A formula with multiple factor names in the dataframe,
like |
WW |
Pass-through argument to |
dropNA |
If TRUE, observed data that are |
A vector with integers representing the group membership of each observation.
Kevin Wright
None
# In the data_eccleston dataframe, each pair of factors is connected. con_check(data_eccleston, ~ row + trt) con_check(data_eccleston, ~ col + trt) con_check(data_eccleston, ~ row + col) # But all three factors are COMPLETELY disconnected into 16 groups. con_check(data_eccleston, ~ row + col + trt)
# In the data_eccleston dataframe, each pair of factors is connected. con_check(data_eccleston, ~ row + trt) con_check(data_eccleston, ~ col + trt) con_check(data_eccleston, ~ row + col) # But all three factors are COMPLETELY disconnected into 16 groups. con_check(data_eccleston, ~ row + col + trt)
Draws a concurrence plot of 2 factors in a dataframe. For example, in a multi-environment yield trial (testing multiple crop varieties in multple environments) it is interesting to examine the balance of the testing pattern. For each pair of environments, how many genotypes are tested in both environments? The concurrence plot shows the amount of connectedness (number of varieties) of the environments with each other.
By default, missing values in the response are deleted.
Replicated combinations of the two factors are ignored. (This could be changed if someone has a need.)
con_concur( data, formula, dropNA = TRUE, xlab = "", ylab = "", cex.x = 0.7, cex.y = 0.7, ... )
con_concur( data, formula, dropNA = TRUE, xlab = "", ylab = "", cex.x = 0.7, cex.y = 0.7, ... )
data |
A dataframe |
formula |
A formula with multiple factor names in the dataframe,
like |
dropNA |
If TRUE, observed data that are |
xlab |
Label for x axis |
ylab |
Label for y axis |
cex.x |
Scale factor for x axis tick labels. Default 0.7. |
cex.y |
Scale factor for y axis tick labels Default 0.7. |
... |
Other parameters passed to the levelplot() function. |
A lattice graphics object
Kevin Wright
None
require(lattice) bar = transform(lattice::barley, env=factor(paste(site,year))) set.seed(123) bar <- bar[sample(1:nrow(bar), 70, replace=TRUE),] con_concur(bar, yield ~ variety / env, cex.x=0.75, cex.y=.3)
require(lattice) bar = transform(lattice::barley, env=factor(paste(site,year))) set.seed(123) bar <- bar[sample(1:nrow(bar), 70, replace=TRUE),] con_concur(bar, yield ~ variety / env, cex.x=0.75, cex.y=.3)
Traditional filtering (subsetting) of data is typically performed via some criteria based on the columns of the data.
In contrast, this function performs filtering of data based on the joint rows and columns of a matrix-view of two factors.
Conceptually, the idea is to re-shape two or three columns of a dataframe into a matrix, and then delete entire rows (or columns) of the matrix if there are too many missing cells in a row (or column).
The two most useful applications of two-way filtering are to:
Remove a factor level that has few interactions with another factor. This is especially useful in linear models to remove rare factor combinations.
Remove a factor level that has any missing interactions with another factor. This is especially useful with biplots of a matrix to remove rows or columns that have missing values.
A formula syntax is used to specify the two-way filtering criteria.
Some examples may provide the easiest understanding.
dat <- data.frame(state=c("NE","NE", "IA", "NE", "IA"), year=c(1,2,2,3,3), value=11:15)
When the 'value' column is re-shaped into a matrix it looks like:
state/year | 1 | 2 | 3 | NE | 11 | 12 | 14 | IA | | 13 | 15 |
Drop states with too much missing combinations. Keep only states with "at least 3 years per state" con_filter(dat, ~ 3 * year / state) NE 1 11 NE 2 12 NE 3 14
Keep only years with "at least 2 states per year" con_filter(dat, ~ 2 * state / year) NE 2 12 IA 2 13 NE 3 14 IA 3 15
If the constant number in the formula is less than 1.0, this is interpreted as a fraction. Keep only states with "at least 75% of years per state" con_filter(dat, ~ .75 * year / state)
It is possible to include another factor on either side of the slash "/". Suppose the data had another factor for political party called "party". Keep only states with "at least 2 combinations of party:year per state" con_filter(dat, ~ 2 * party:year / state)
If the formula contains a response variable, missing values are dropped first, then the two-way filtering is based on the factor combinations. con_filter(dat, value ~ 2 * state / year)
con_filter(data, formula, verbose = TRUE, returndropped = FALSE)
con_filter(data, formula, verbose = TRUE, returndropped = FALSE)
data |
A dataframe |
formula |
A formula with two factor names in the dataframe
that specifies the criteria for filtering,
like |
verbose |
If TRUE, print some diagnostic information about what data is being deleted. (Similar to the 'tidylog' package). |
returndropped |
If TRUE, return the dropped rows instead of the kept rows. Default is FALSE. |
The original dataframe is returned, minus rows that are filtered out.
Kevin Wright
None.
dat <- data.frame( gen = c("G3", "G4", "G1", "G2", "G3", "G4", "G5", "G1", "G2", "G3", "G4", "G5", "G1", "G2", "G3", "G4", "G5", "G1", "G2", "G3", "G4", "G5"), env = c("E1", "E1", "E1", "E1", "E1", "E1", "E1", "E2", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4"), yield = c(65, 50, NA, NA, 65, 50, 60, NA, 71, 76, 80, 82, 90, 93, 95, 102, 97, 98, 102, 105, 130, 135)) # How many observations are there for each combination of gen*env? with( subset(dat, !is.na(yield)) , table(gen,env) ) # Note, if there is no response variable, the two-way filtering is based # only on the presence of the factor combinations. dat1 <- con_filter(dat, ~ 4*env / gen) # If there is a response variable, missing values are dropped first, # then the two-way filtering is based on the factor combinations. dat1 <- con_filter(dat, yield ~ 4*env/gen) dat1 <- con_filter(dat, yield ~ 5*env/ gen) dat1 <- con_filter(dat, yield ~ 6*gen/ env) dat1 <- con_filter(dat, yield ~ .8 *env / gen) dat1 <- con_filter(dat, yield ~ .8* gen / env) dat1 <- con_filter(dat, yield ~ 7 * env / gen)
dat <- data.frame( gen = c("G3", "G4", "G1", "G2", "G3", "G4", "G5", "G1", "G2", "G3", "G4", "G5", "G1", "G2", "G3", "G4", "G5", "G1", "G2", "G3", "G4", "G5"), env = c("E1", "E1", "E1", "E1", "E1", "E1", "E1", "E2", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4"), yield = c(65, 50, NA, NA, 65, 50, 60, NA, 71, 76, 80, 82, 90, 93, 95, 102, 97, 98, 102, 105, 130, 135)) # How many observations are there for each combination of gen*env? with( subset(dat, !is.na(yield)) , table(gen,env) ) # Note, if there is no response variable, the two-way filtering is based # only on the presence of the factor combinations. dat1 <- con_filter(dat, ~ 4*env / gen) # If there is a response variable, missing values are dropped first, # then the two-way filtering is based on the factor combinations. dat1 <- con_filter(dat, yield ~ 4*env/gen) dat1 <- con_filter(dat, yield ~ 5*env/ gen) dat1 <- con_filter(dat, yield ~ 6*gen/ env) dat1 <- con_filter(dat, yield ~ .8 *env / gen) dat1 <- con_filter(dat, yield ~ .8* gen / env) dat1 <- con_filter(dat, yield ~ 7 * env / gen)
If there is replication for the treatment combination cells in a two-way table, the replications are averaged together (or counted) before constructing the heatmap.
By default, rows and columns are clustered using the 'incidence' matrix of 0s and 1s.
The function checks to see if the cells in the heatmap form a connected set. If not, the data is divided into connected subsets and the subset group number is shown within each cell.
By default, missing values in the response are deleted.
Factor levels are shown along the left and bottom sides.
The number of cells in each column/row is shown along the top/right sides.
If the 2 factors are disconnected, the group membership ID is shown in each cell.
con_view( data, formula, fun.aggregate = mean, xlab = "", ylab = "", cex.num = 0.75, cex.x = 0.7, cex.y = 0.7, col.regions = RedGrayBlue, cluster = "incidence", dropNA = TRUE, ... )
con_view( data, formula, fun.aggregate = mean, xlab = "", ylab = "", cex.num = 0.75, cex.x = 0.7, cex.y = 0.7, col.regions = RedGrayBlue, cluster = "incidence", dropNA = TRUE, ... )
data |
A dataframe |
formula |
A formula with two (or more) factor names in the dataframe
like |
fun.aggregate |
The function to use for aggregating data in cells. Default is mean. |
xlab |
Label for x axis |
ylab |
Label for y axis |
cex.num |
Disjoint group number. |
cex.x |
Scale factor for x axis tick labels. Default 0.7. |
cex.y |
Scale factor for y axis tick labels Default 0.7. |
col.regions |
Function for color regions. Default RedGrayBlue. |
cluster |
If "incidence", cluster rows and columns by the incidence matrix. If FALSE, no clustering is performed. |
dropNA |
If TRUE, observed data that are |
... |
Other parameters passed to the levelplot() function. |
A lattice graphics object
Kevin Wright
require(lattice) bar = transform(lattice::barley, env=factor(paste(site,year))) set.seed(123) bar <- bar[sample(1:nrow(bar), 70, replace=TRUE),] con_view(bar, yield ~ variety * env, cex.x=1, cex.y=.3, cluster=FALSE) # Create a heatmap of cell counts w2b = colorRampPalette(c('wheat','black')) con_view(bar, yield ~ variety * env, fun.aggregate=length, cex.x=1, cex.y=.3, col.regions=w2b, cluster=FALSE) # Example from paper by Fernando et al. (1983). set.seed(42) data_fernando = transform(data_fernando, y=stats::rnorm(9, mean=100)) con_view(data_fernando, y ~ gen*herd, cluster=FALSE, main = "Fernando unsorted") con_view(data_fernando, y ~ gen*herd, cluster=TRUE, main = "Fernando unsorted") # Example from Searle (1971), Linear Models, p. 325 dat2 = transform(data_searle, y=stats::rnorm(nrow(data_searle)) + 100) con_view(dat2, y ~ f1*f2, cluster=FALSE, main="data_searle unsorted") con_view(dat2, y ~ f1*f2, main="data_searle clustered")
require(lattice) bar = transform(lattice::barley, env=factor(paste(site,year))) set.seed(123) bar <- bar[sample(1:nrow(bar), 70, replace=TRUE),] con_view(bar, yield ~ variety * env, cex.x=1, cex.y=.3, cluster=FALSE) # Create a heatmap of cell counts w2b = colorRampPalette(c('wheat','black')) con_view(bar, yield ~ variety * env, fun.aggregate=length, cex.x=1, cex.y=.3, col.regions=w2b, cluster=FALSE) # Example from paper by Fernando et al. (1983). set.seed(42) data_fernando = transform(data_fernando, y=stats::rnorm(9, mean=100)) con_view(data_fernando, y ~ gen*herd, cluster=FALSE, main = "Fernando unsorted") con_view(data_fernando, y ~ gen*herd, cluster=TRUE, main = "Fernando unsorted") # Example from Searle (1971), Linear Models, p. 325 dat2 = transform(data_searle, y=stats::rnorm(nrow(data_searle)) + 100) con_view(dat2, y ~ f1*f2, cluster=FALSE, main="data_searle unsorted") con_view(dat2, y ~ f1*f2, main="data_searle clustered")
Data from Eccleston & Russell
data_eccleston
data_eccleston
An object of class data.frame
with 16 rows and 3 columns.
A dataframe with 3 treatment factors. Each pair of factors is connected, but the 3 factors are disconnected. The 'trt' column uses numbers to match Eccleston (Table 1, Design 1) and letters to match Foulley (Table 13.3).
Eccleston, J. and K. Russell (1975). Connectedness and orthogonality in multi-factor designs. Biometrika, 62, 341-345. https://doi.org/10.1093/biomet/62.2.341
Foulley, J. L., Bouix, J., Goffinet, B., & Elsen, J. M. (1990). Connectedness in Genetic Evaluation. Advanced Series in Agricultural Sciences, 277–308. https://doi.org/10.1007/978-3-642-74487-7_13
# Each pair of factors is connected con_check(data_eccleston, ~ row + trt) con_check(data_eccleston, ~ col + trt) con_check(data_eccleston, ~ row + col) # But all three factors are COMPLETELY disconnected con_check(data_eccleston, ~ row + col + trt) set.seed(42) data_eccleston <- transform(data_eccleston, y = rnorm(nrow(data_eccleston), mean=100)) con_view(data_eccleston, y ~ row*col, xlab="row", ylab="col") con_view(data_eccleston, y ~ row*trt, xlab="row", ylab="trt") con_view(data_eccleston, y ~ col*trt, xlab="col", ylab="trt")
# Each pair of factors is connected con_check(data_eccleston, ~ row + trt) con_check(data_eccleston, ~ col + trt) con_check(data_eccleston, ~ row + col) # But all three factors are COMPLETELY disconnected con_check(data_eccleston, ~ row + col + trt) set.seed(42) data_eccleston <- transform(data_eccleston, y = rnorm(nrow(data_eccleston), mean=100)) con_view(data_eccleston, y ~ row*col, xlab="row", ylab="col") con_view(data_eccleston, y ~ row*trt, xlab="row", ylab="trt") con_view(data_eccleston, y ~ col*trt, xlab="col", ylab="trt")
Data from Fernando et al.
data_fernando
data_fernando
An object of class data.frame
with 9 rows and 2 columns.
A dataframe with 2 treatment factors. The treatment combinations form 2 disconnected groups.
Fernando et al. (1983). Identifying All Connected Subsets In A Two-Way Classification Without Interaction. J. of Dairy Science, 66, 1399-1402. Table 1. https://doi.org/10.3168/jds.S0022-0302(83)81951-1
library(lfe) cbind(data_fernando, .group=con_check(data_fernando, ~ gen + herd)) library(connected) set.seed(42) data_fernando = transform(data_fernando, y=stats::rnorm(9, mean=100)) con_view(data_fernando, y ~ gen*herd, cluster=FALSE, main = "Fernando unsorted") con_view(data_fernando, y ~ gen*herd, main="Fernando clustered")
library(lfe) cbind(data_fernando, .group=con_check(data_fernando, ~ gen + herd)) library(connected) set.seed(42) data_fernando = transform(data_fernando, y=stats::rnorm(9, mean=100)) con_view(data_fernando, y ~ gen*herd, cluster=FALSE, main = "Fernando unsorted") con_view(data_fernando, y ~ gen*herd, main="Fernando clustered")
Data from Searle
data_searle
data_searle
An object of class data.frame
with 14 rows and 2 columns.
A dataframe with 2 treatment factors. The treatment combinations form 3 disconnected groups.
Searle (1971). Linear Models. Page 324.
cbind(data_searle, .group=con_check(data_searle, ~ f1 + f2)) data_searle = transform(data_searle, y = rnorm(nrow(data_searle), mean=100)) con_view(data_searle, y ~ f1*f2, cluster=FALSE, main="Searle unsorted") con_view(data_searle, y ~ f1*f2, main="Searle clustered")
cbind(data_searle, .group=con_check(data_searle, ~ f1 + f2)) data_searle = transform(data_searle, y = rnorm(nrow(data_searle), mean=100)) con_view(data_searle, y ~ f1*f2, cluster=FALSE, main="Searle unsorted") con_view(data_searle, y ~ f1*f2, main="Searle clustered")
Simulated Data of Student Test Scores
data("data_student")
data("data_student")
A data frame with 41 observations on the following 4 variables.
student
student ID
class
class/subject
test1
score on test 1
test2
score on test 2
This simulated data imagines the following scenario:
In a school, 12 students (A-M) were tested in 6 classes (math, chemistry, physics, art, horticulture, welding).
Not all students were enrolled in all classes.
In each class, students were given test 1, then later test 2. Test scores could be 0-100.
(Using 2 tests is typical for learning assessment, measuring intervention effectiveness, comparison of test forms, etc.)
Some students missed class on the day of a test and had no score for that test.
Student B felt ill during art test 1 and was allowed to re-take test 1 (both test 1 scores are included and the same test 2 score was used).
None
None
data(data_student)
data(data_student)
Data from Tosh
data_tosh
data_tosh
An object of class data.frame
with 15 rows and 3 columns.
A dataframe with 3 treatment factors. The treatment combinations form 2 disconnected groups.
Tosh, J. J., and J. W. Wilton. (1990). Degree of connectedness in mixed models. Proceedings of the 4th World Congress on Genetics applied to Livestock Production, 480-483. Page 481.
cbind(data_tosh, .group=con_check(data_tosh, ~ a + b + c)) data_tosh = transform(data_tosh, y = rnorm(nrow(data_tosh), mean=100)) library(connected) con_view(data_tosh, y ~ b * c)
cbind(data_tosh, .group=con_check(data_tosh, ~ a + b + c)) data_tosh = transform(data_tosh, y = rnorm(nrow(data_tosh), mean=100)) library(connected) con_view(data_tosh, y ~ b * c)
Data from Weeks & Williams example 1
data_weeks1
data_weeks1
An object of class data.frame
with 16 rows and 3 columns.
A dataframe with 3 treatment factors. The treatment combinations are connected.
Note: This data is based on Table 1 of Weeks & Williams. Table 2 is missing treatment combination (1,2,4).
Weeks, David L. & Donald R. Williams (1964). A Note on the Determination of Connectedness in an N-Way Cross Classification. Technometrics, 6:3, 319-324. Table 1. http://dx.doi.org/10.1080/00401706.1964.10490188
library(lfe) cbind(data_weeks1, .group=con_check(data_weeks1, ~ f1+f2+f3))
library(lfe) cbind(data_weeks1, .group=con_check(data_weeks1, ~ f1+f2+f3))
Data from Weeks & Williams example 2
data_weeks2
data_weeks2
An object of class data.frame
with 62 rows and 3 columns.
A dataframe with 3 treatment factors. The treatment combinations form 4 disconnected groups.
Note: This data is based on Table 3 of Weeks & Williams. The groups defined in the text are missing some combinations.
Weeks, David L. & Donald R. Williams (1964). A Note on the Determination of Connectedness in an N-Way Cross Classification. Technometrics, 6:3, 319-324. Table 3. http://dx.doi.org/10.1080/00401706.1964.10490188
library(lfe) cbind(data_weeks2, .group=con_check(data_weeks2, ~f1 + f2 + f3))
library(lfe) cbind(data_weeks2, .group=con_check(data_weeks2, ~f1 + f2 + f3))