| Title: | Sum of Ranking Differences Statistical Test |
|---|---|
| Description: | We provide an implementation for Sum of Ranking Differences (SRD), a novel statistical test introduced by Héberger (2010) <doi:10.1016/j.trac.2009.09.009>. The test allows the comparison of different solutions through a reference by first performing a rank transformation on the input, then calculating and comparing the distances between the solutions and the reference - the latter is measured in the L1 norm. The reference can be an external benchmark (e.g. an established gold standard) or can be aggregated from the data. The calculated distances, called SRD scores, are validated in two ways, see Héberger and Kollár-Hunek (2011) <doi:10.1002/cem.1320>. A randomization test (also called permutation test) compares the SRD scores of the solutions to the SRD scores of randomly generated rankings. The second validation option is cross-validation that checks whether the rankings generated from the solutions come from the same distribution or not. For a detailed analysis about the cross-validation process see Sziklai, Baranyi and Héberger (2021) <doi:10.48550/arXiv.2105.11939>. The package offers a wide array of features related to SRD including the computation of the SRD scores, validation options, input preprocessing and plotting tools. |
| Authors: | Jochen Staudacher [aut, cph, cre] (ORCID: <https://orcid.org/0000-0002-0619-4606>), Balázs R. Sziklai [aut, cph] (ORCID: <https://orcid.org/0000-0002-0068-8920>), Linus Olsson [aut, cph], Dennis Horn [ctb], Alexander Pothmann [ctb], Ali Tugay Sen [ctb], Attila Gere [ctb] (ORCID: <https://orcid.org/0000-0003-3075-1561>), Károly Hébeger [ctb] (ORCID: <https://orcid.org/0000-0003-0965-939X>) |
| Maintainer: | Jochen Staudacher <[email protected]> |
| License: | GPL-3 |
| Version: | 0.2.0 |
| Built: | 2026-06-30 16:50:58 UTC |
| Source: | https://github.com/cran/rSRD |
R interface to test whether the rankings induced by the columns come from the same distribution. If the number of folds and the test method are not specified, the default is the 8-fold Wilcoxon test combined with cross-validation. If the number of rows is less than 8, leave-one-out cross-validation is applied. Columns are ordered based on the SRD values of the different folds, then each consecutive column-pairs are tested. Test statistics for Alpaydin test follows F distribution with df1=2k, df2=k degrees of freedom. Dietterich test statistics follow t-distribution with k degrees of freedom (two-tailed). Wilcoxon test statistics is calculated as the absolute value of the difference of the sum of the positive ranks (W+) and sum of the negative ranks (W-). The distribution for this test statistics can be derived from the Wilcoxon signed rank distribution. For more information about the cross-validation process see Sziklai, Baranyi and Héberger (2021).
calculateCrossValidation( data_matrix, method = "Wilcoxon", number_of_folds = 8, precision = 5, output_to_file = TRUE, seed = NULL )calculateCrossValidation( data_matrix, method = "Wilcoxon", number_of_folds = 8, precision = 5, output_to_file = TRUE, seed = NULL )
data_matrix |
A data frame. All columns must be numeric and free of
|
method |
A string specifying the method. The methods "Wilcoxon", "Alpaydin" and "Dietterich" are available. |
number_of_folds |
The number of folds used in the cross validation. Ranges between 5 to 10. |
precision |
The precision used for the the ranking matrix transformation. |
output_to_file |
Boolean flag to enable file output. |
seed |
An optional integer seed for the random number generator in the
C++ shuffling routines, enabling reproducible results. When |
A List containing
a new column order sorted by the median of the SRD values computed on the different folds
a vector of test statistics corresponding to each consecutive column pairs
a vector indicating the test statistics' statistical significance
the SRD values of different folds and
additional data needed for the plotCrossValidation function.
Balázs R. Sziklai [email protected], Linus Olsson [email protected], Jochen Staudacher [email protected]
Sziklai, Balázs R., Máté Baranyi, and Károly Héberger (2021). "Testing Cross-Validation Variants in Ranking Environments", arXiv preprint arXiv:2105.11939 (2021).
df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) calculateCrossValidation(df, output_to_file = FALSE) calculateCrossValidation(df, output_to_file = FALSE, seed = 42)df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) calculateCrossValidation(df, output_to_file = FALSE) calculateCrossValidation(df, output_to_file = FALSE, seed = 42)
R interface to calculate the SRD distribution that corresponds
to the data. The simulation draws 1,000,000 random rankings; for small
n the resulting thresholds (XX1, XX19) can vary slightly between
runs. Use the seed parameter for fully reproducible results.
calculateSRDDistribution( data_matrix, option = "f", tie_probability = 0, output_to_file = FALSE, seed = NULL )calculateSRDDistribution( data_matrix, option = "f", tie_probability = 0, output_to_file = FALSE, seed = NULL )
data_matrix |
A data frame. All columns must be numeric and free of
|
option |
A character to specify how ties are generated in the simulation. The following options are available:
|
tie_probability |
The probability with which ties can occur. |
output_to_file |
Boolean flag to enable file output. Default
|
seed |
An optional integer seed for the random number generator in the
C++ simulation, enabling reproducible results. When |
A list containing the SRD distribution and related descriptive
statistics. xx1 indicates the 5 percent significance threshold.
SRD values between xx1 and xx19 are indistinguishable from
random rankings; a value above xx19 indicates reverse ordering
(5 percent significance).
Balázs R. Sziklai [email protected]
Linus Olsson [email protected]
Jochen Staudacher [email protected]
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDDistribution(df, option = 'p', tie_probability = 0.5) # Reproducible run: calculateSRDDistribution(df, seed = 42)df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDDistribution(df, option = 'p', tie_probability = 0.5) # Reproducible run: calculateSRDDistribution(df, seed = 42)
R interface to calculate SRD values. To test the results' significance run calculateSRDDistribution(). For more information about SRD scores and their validation see Héberger and Kollár-Hunek (2011).
calculateSRDValues(data_matrix, output_to_file = FALSE)calculateSRDValues(data_matrix, output_to_file = FALSE)
data_matrix |
A data frame. All columns must be numeric and free of
|
output_to_file |
Boolean flag to enable file output. Default
|
A numeric vector containing the normalised SRD values (each in
) for every non-reference column.
Balázs R. Sziklai [email protected]
Linus Olsson [email protected]
Jochen Staudacher [email protected]
Héberger K., Kollár-Hunek K. (2011) "Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers", Journal of Chemometrics, 25(4), pp. 151-158.
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDValues(df)df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDValues(df)
Plots data generated by the calculateCrossValidation function as a boxplot. Includes max and min as whiskers as well as the average (marked by a crossed circle), median (marked by a horizontal bold line) and the 1st and 3rd quartile of the values. Box geometry (min, Q1, median, Q3, max) is taken directly from the precomputed 'boxplot_values', the single source of truth also reported to the user elsewhere, rather than being re-derived from the fold-wise values. This avoids a second, independent quantile calculation that could drift from the official figures – and, in particular, guarantees that identical reported statistics (whether within one solution, e.g. Q1 equal to the median, or across different solutions sharing the same value) are always drawn at exactly the same height.
plotCrossValidation(cv_results)plotCrossValidation(cv_results)
cv_results |
The List of results returned by the calculateCrossValidation function. |
None.
Linus Olsson [email protected]
Alexander Pothmann
Jochen Staudacher [email protected]
df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) cv_results <- rSRD::calculateCrossValidation(df, output_to_file = FALSE) rSRD::plotCrossValidation(cv_results)df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) cv_results <- rSRD::calculateCrossValidation(df, output_to_file = FALSE) rSRD::plotCrossValidation(cv_results)
Heatmap is generated based on the pairwise distance - measured in SRD - of the columns. Each column is set as reference once, then SRD values are calculated for the other columns.
plotHeatmapSRD(df, output_to_file = FALSE, color = utilsColorPalette)plotHeatmapSRD(df, output_to_file = FALSE, color = utilsColorPalette)
df |
A DataFrame. |
output_to_file |
Logical. If true, the distance matrix will be saved to the hard drive. |
color |
Vector of colors used for the image. Defaults to colors |
Returns a heatmap and the corresponding distance matrix.
Attila Gere [email protected], Linus Olsson [email protected], Jochen Staudacher [email protected]
srdInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) plotHeatmapSRD(srdInput) mycolors<- c("#e3f2fd", "#bbdefb", "#90caf9","#64b5f6","#42a5f5", "#2196f3","#1e88e5","#1976d2","#1565c0","#0d47a1") plotHeatmapSRD(srdInput, color=mycolors)srdInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) plotHeatmapSRD(srdInput) mycolors<- c("#e3f2fd", "#bbdefb", "#90caf9","#64b5f6","#42a5f5", "#2196f3","#1e88e5","#1976d2","#1565c0","#0d47a1") plotHeatmapSRD(srdInput, color=mycolors)
Plots the permutation test for the given data frame by using the simulation data created by the calculateSRDDistribution() function. Vertical lines mark the positions of the solution columns on the simulated SRD distribution. Dashed lines indicate the 5 percent significance thresholds (XX1, XX19) and the median (Med).
plotPermTest(df, simulationData, densityToDistr = FALSE)plotPermTest(df, simulationData, densityToDistr = FALSE)
df |
A DataFrame. |
simulationData |
The output of the calculateSRDDistribution() function. |
densityToDistr |
Logical. If |
None.
Linus Olsson [email protected]
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) simulationData <- rSRD::calculateSRDDistribution(df) plotPermTest(df, simulationData)df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) simulationData <- rSRD::calculateSRDDistribution(df) plotPermTest(df, simulationData)
Calculates the distance of two rankings in $L_1$ norm and inserts the result after the first.
utilsCalculateDistance(df, nameCol, refCol)utilsCalculateDistance(df, nameCol, refCol)
df |
A DataFrame. |
nameCol |
The current Column of the iteration. |
refCol |
The reference Column of the dataFrame. |
Returns a new df that has a Distance Column based on the nameCol.
Ali Tugay Sen, Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) nameCol <- "A" refCol <- "B" rSRD::utilsCalculateDistance(SRDInput,nameCol,refCol)SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) nameCol <- "A" refCol <- "B" rSRD::utilsCalculateDistance(SRDInput,nameCol,refCol)
Calculates the ranking of a given column.
utilsCalculateRank(df, nameCol)utilsCalculateRank(df, nameCol)
df |
A DataFrame. |
nameCol |
The name of the column to be ranked. Note that this parameter needs to be specified as there is no default value. |
Returns a new df that has an additional column with
the rankings of the column specified by nameCol.
Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) columnName <- "A" rSRD::utilsCalculateRank(SRDInput,columnName)SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) columnName <- "A" rSRD::utilsCalculateRank(SRDInput,columnName)
Unique color palette for heatmaps.
utilsColorPaletteutilsColorPalette
An object of class character of length 250.
Attila Gere [email protected]
Balázs R. Sziklai [email protected],
Jochen Staudacher [email protected]
barplot(rep(1,250), col = utilsColorPalette)barplot(rep(1,250), col = utilsColorPalette)
Adds a new reference column based on the input DataFrame df and the given method. This function iterates over the rows and applies the given method to define the value of the reference. Available options are: max, min, median, mean and mixed. This column is appended to the DataFrame. When "mixed" is specified the function will consider the refVector for creating the reference column.
utilsCreateReference(df, method = "max", refVector = c())utilsCreateReference(df, method = "max", refVector = c())
df |
A DataFrame. |
method |
A string value specifying the reference creating method. Available options: max, min, median, mean and mixed. |
refVector |
A vector of strings that specifies a method for each row. Vector size should be equal to the number of rows in the DataFrame df. |
Returns a new DataFrame appended with the reference column created by the method.
Ali Tugay Sen, Linus Olsson [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) proc_data <- rSRD::utilsPreprocessDF(SRDInput) ref <- c("min","max","min","max","mean") rSRD::utilsCreateReference(proc_data, method = "mixed", ref)SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) proc_data <- rSRD::utilsPreprocessDF(SRDInput) ref <- c("min","max","min","max","mean") rSRD::utilsCreateReference(proc_data, method = "mixed", ref)
Detailed calculation of the SRD values including the computation of the ranking transformation. Unless there is a column specified with referenceCol the last column will always taken as the reference.
utilsDetailedSRD( df, referenceCol, createRefCol = function() { } )utilsDetailedSRD( df, referenceCol, createRefCol = function() { } )
df |
A DataFrame. |
referenceCol |
Optional. A string that contains a column of |
createRefCol |
Optional. Can be max, min, median, mean. Creates a new Column based on the existing |
Returns a new DataFrame that shows the detailed SRD computation (ranking transformation and distance calculation). A newly added row contains the SRD values (displayed without normalization).
Ali Tugay Sen, Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRD(SRDInput)SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRD(SRDInput)
Detailed calculation of the SRD values including the computation of the ranking transformation.
Unless there is a column specified with referenceCol the last column will always taken as the reference.
This variant differs from utilsDetailedSRD in that non-numeric columns will not be converted to chars,
i.e. the data types of non-numeric columns will be preserved in the output.
utilsDetailedSRDNoChars( df, referenceCol, createRefCol = function() { } )utilsDetailedSRDNoChars( df, referenceCol, createRefCol = function() { } )
df |
A DataFrame. |
referenceCol |
Optional. A string that contains a column of |
createRefCol |
Optional. Can be max, min, median, mean. Creates a new Column based on the existing |
Returns a new DataFrame that shows the detailed SRD computation (ranking transformation and distance calculation). A newly added row contains the SRD values (displayed without normalization).
Ali Tugay Sen, Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRDNoChars(SRDInput)SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRDNoChars(SRDInput)
Calculates the maximum distance between two rankings of size n. This function is used to normalize SRD values.
utilsMaxSRD(rowsCount)utilsMaxSRD(rowsCount)
rowsCount |
The number of rows in the SRD calculation. |
The maximum achievable SRD value.
Dennis Horn
maxSRD <- rSRD::utilsMaxSRD(5)maxSRD <- rSRD::utilsMaxSRD(5)
This function preprocesses the DataFrame depending on the method.
utilsPreprocessDF(df, method = "range_scale")utilsPreprocessDF(df, method = "range_scale")
df |
A DataFrame. |
method |
A string that should contain "scale_to_unit", "standardize", "range_scale" or "scale_to_max". |
Returns a new df that has been preprocessed based on the method chosen.
Ali Tugay Sen, Dennis Horn [email protected], Linus Olsson [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) method <- "standardize" utilsPreprocessDF(SRDInput,method)SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) method <- "standardize" utilsPreprocessDF(SRDInput,method)
R interface to perform the rank transformation on the columns of the input data frame. Ties are resolved by fractional ranking.
utilsRankingMatrix(data_matrix)utilsRankingMatrix(data_matrix)
data_matrix |
A data frame. All columns must be numeric and free of
|
A data frame containing the ranking matrix.
Balázs R. Sziklai [email protected], Linus Olsson [email protected]
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) utilsRankingMatrix(df)df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) utilsRankingMatrix(df)
Calculates the tie probability for a given vector. The tie probability is defined as the number of consecutive tied component-pairs in the sorted vector divided by the size of the vector minus 1.
utilsTieProbability(x)utilsTieProbability(x)
x |
A vector. |
Returns the tie probability as a numeric value.
Ali Tugay Sen, Linus Olsson [email protected]
x <-c(1,2,4,4,5,5,6) rSRD::utilsTieProbability(x)x <-c(1,2,4,4,5,5,6) rSRD::utilsTieProbability(x)