--- title: "Getting Started with rSRD" author: - "Balázs R. Sziklai ([0000-0002-0068-8920](https://orcid.org/0000-0002-0068-8920))
HUN-REN Centre for Economic and Regional Studies, 1097 Tóth Kálmán u. 4, Budapest, Hungary
Department of Operations Research and Actuarial Sciences, Corvinus University of Budapest, 1093 Fővám tér 8, Budapest, Hungary
Email: sziklai.balazs@krtk.elte.hu" - "Attila Gere ([0000-0003-3075-1561](https://orcid.org/0000-0003-3075-1561))
Hungarian University of Agriculture and Life Sciences, 1118 Villányi út 29-43, Budapest, Hungary
Email: Gere.Attila@uni-mate.hu" - "Károly Hébeger ([0000-0003-0965-939X](https://orcid.org/0000-0003-0965-939X))
Plasma Chemistry Reasearch Group, Institute of Materials and Environmental Chemistry, HUN-REN Research Centre for Natural Sciences, 1117 Magyar tudósok körútja 2, Budapest, Hungary
Email: heberger.karoly@ttk.hu" - "Jochen Staudacher ([0000-0002-0619-4606](https://orcid.org/0000-0002-0619-4606))
Fakultät Informatik, Hochschule Kempten, Bahnhofstr. 61, 87435 Kempten, Germany
Email: jochen.staudacher@hs-kempten.de" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with rSRD} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(rSRD) ``` ## What is Sum of Ranking Differences? When several methods, models, or scoring systems evaluate the same set of objects, the following natural question arises. Which method agrees most closely with a set of reference values? Sum of Ranking Differences (SRD) answers this question with a single, interpretable number. The idea is straightforward. Each method produces a ranking of the objects. A reference is chosen --- either an external gold standard or one derived from the data itself (for example, the row-wise mean or median). For each method, the absolute differences between its ranks and the reference ranks are summed. A small SRD value means the method tracks the reference closely, whereas a large SRD value means the method disagrees with it. `rSRD` normalises SRD values to the interval $[0, 1]$, where 0 indicates perfect agreement and 1 indicates the maximum possible disagreement given the number of objects being ranked (i.e., scaled by the theoretical maximum SRD). To determine whether an observed SRD value is meaningfully small or merely within the range expected by chance under the null hypothesis that the method ranks objects randomly, the package uses a Monte Carlo permutation test (shuffling the assigned ranks and recalculating the SRD) with 1,000,000 random rankings. For a full theoretical treatment of Sum of Ranking Differences, please refer to the original paper Héberger and Kollár-Hunek (2011) and the recent extensions on consensus ranking and significance testing in Sziklai, Baranyi and Héberger (2025). --- ## The movies1994 dataset Throughout this vignette we use the `movies1994` dataset, which is bundled with `rSRD`. It covers 14 films released in 1994 and records five scores for each: | Column | Description | |---|---| | `RT Reviewers` | Rotten Tomatoes critic score (%) | | `RT Audience` | Rotten Tomatoes audience score (%) | | `IMDB` | IMDB user rating (out of 10) | | `Box Office` | Worldwide box office gross (USD) | | `Nominations` | Total award nominations received | | `Awards` | Total awards won | | `Recognition score` | Composite score: Awards + 0.5 * Nominations. Used as the reference against which all other scoring systems are compared. | The last column, `Recognition score`, is used as the reference --- it combines nominations and awards into a single measure of industry recognition against which the other scoring systems are compared. ```{r load-data} path <- system.file("extdata", "movies1994.csv", package = "rSRD") movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1, check.names = FALSE) movies ``` --- ## Preprocessing The five scoring systems use very different scales: critic percentages, a 10-point rating, a raw dollar figure, and raw counts. Before running SRD it is good practice to bring all columns onto a common scale so that no single variable dominates purely because of its unit of measurement. Note that each data preprocessing method is strictly monotonic; hence, if the reference is provided and not aggregated from the data, preprocessing has no effect on the SRD analysis. `utilsPreprocessDF()` offers four methods: `"range_scale"` (maps each column to $[0, 1]$), `"standardize"` (zero mean, unit variance), `"scale_to_unit"` (L2 normalisation), and `"scale_to_max"` (divides by the column maximum). Range scaling is the most natural choice here since all columns measure inherently positive quantities. ```{r preprocess} movies_scaled <- utilsPreprocessDF(movies, method = "range_scale") round(head(movies_scaled), 3) ``` Note that the reference column (`Recognition score`) is scaled along with the rest. After scaling, every value lies between 0 and 1. ### Constructing a reference column If you do not have a natural reference column, `utilsCreateReference()` can derive one from the data itself. Available options are `"max"`, `"min"`, `"mean"`, and `"median"` (applied row-wise), as well as `"mixed"` for a row-by-row specification. ```{r create-reference, eval = FALSE} # Example: use the row-wise median as the reference movies_with_ref <- utilsCreateReference(movies_scaled[, -ncol(movies_scaled)], method = "median") ``` In our case `movies_scaled` already contains `Recognition score` as the last column, so we use it directly as the reference. --- ## Computing SRD values `calculateSRDValues()` computes the normalised SRD score for every non-reference column. The last column of the data frame is always treated as the reference. ```{r srd-values} srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE) srd ``` The result is a named numeric vector with one entry per solution column. Values closer to 0 indicate stronger agreement with the `Recognition score` reference. `Awards` and `Nominations` agree most closely with the `Recognition score` reference. This is not surprising since the reference is a weighted combination of these two measures (`Recognition = Awards + 0.5 * Nominations`), i.e. their high agreement is a mathematical consequence rather than an independent finding. Among the remaining scoring systems, `RT Reviewers` and `IMDB` show notably lower SRD values than `RT Audience` and `Box Office`, suggesting that critics and dedicated film raters track award recognition more closely than general audiences or commercial success. --- ## Testing significance via permutation tests While the raw SRD values provide an initial ranking of method performance, their interpretation requires a statistical baseline. Specifically, we must ask whether an observed SRD is significantly lower than what would arise from random ranking. The function `calculateSRDDistribution()` performs this by simulating 1,000,000 random permutations of the ranks and computing their SRD distribution. Two threshold values are derived from this distribution: - **XX1** (5th percentile): SRD values below this threshold agree with the reference *significantly better than chance* at the 5% level. - **XX19** (95th percentile): SRD values above this threshold rank objects in *reverse order* relative to the reference, also at the 5% level. - SRD values between XX1 and XX19 are indistinguishable from random rankings. Because the simulation is stochastic, the thresholds can vary slightly between runs, in particular for small datasets. We recommend to always set a seed in published analyses to ensure reproducibility. ```{r srd-dist, eval = FALSE} sim <- calculateSRDDistribution(movies_scaled, seed = 42) cat("XX1 (5% threshold):", sim$xx1, "\n") cat("Median: ", sim$median, "\n") cat("XX19 (95% threshold):", sim$xx19, "\n") ``` ```{r srd-dist-hidden, echo = FALSE} # Run with a fixed seed so the vignette output is reproducible sim <- calculateSRDDistribution(movies_scaled, seed = 42) cat("XX1 (5% threshold):", sim$xx1, "\n") cat("Median: ", sim$median, "\n") cat("XX19 (95% threshold):", sim$xx19, "\n") ``` `plotPermTest()` overlays the observed SRD values on the simulated distribution as vertical coloured lines, with dashed lines marking XX1, the median, and XX19. Solutions whose lines fall to the left of the XX1 dashed line agree with the reference significantly better than chance. Solutions to the right of XX19 indicate a significantly reversed ranking. ```{r perm-plot, eval = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Permutation test plot showing SRD distribution and solution positions"} plotPermTest(movies_scaled, sim) ``` ```{r perm-plot-hidden, echo = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Permutation test plot showing SRD distribution and solution positions"} plotPermTest(movies_scaled, sim) ``` The permutation test reveals that `RT Reviewers` (critics' score) and `IMDB` ratings agree with the `Recognition score` reference significantly better than chance --- their SRD values fall to the left of the XX1 threshold. `Awards` and `Nominations` are also significant, though this is largely a consequence of the `Recognition score` reference being composed of these two measures, i.e. their significance should be interpreted with caution. `RT Audience` and `Box Office` fall between XX1 and XX19, meaning their rankings are statistically indistinguishable from a random ordering relative to the `Recognition score` reference. `Box Office` is the farthest from the reference, which suggests that award committees and general audiences evaluate films by different criteria. Commercial blockbusters such as *The Mask* and *Speed* rank highly by box office but received comparatively few nominations, while critically acclaimed films such as *The Shawshank Redemption* --- a box office disappointment on release --- were heavily nominated. This disconnect between commercial success and award recognition is one of the most striking findings the SRD analysis reveals. Note that with only 14 observations the simulation thresholds are more variable than they would be for a larger dataset. The `seed = 42` argument ensures the thresholds reported here are reproducible. --- ## Cross-validation The permutation test provides an absolute benchmark by testing whether a given method's SRD is significantly lower than random expectation. Cross-validation extends this framework to a relative comparison. It tests whether the SRD values of two methods are statistically distinguishable from each other, thereby enabling direct pairwise model selection. The function `calculateCrossValidation()` divides the data into folds, computes SRD values for each fold and executes a statistical test (Wilcoxon, Alpaydin or Dietterich) between consecutive pairs of solutions sorted by their median fold SRD. ```{r cv-hidden, echo = FALSE} cv <- calculateCrossValidation(movies_scaled, method = "Wilcoxon", number_of_folds = 7, output_to_file = FALSE, seed = 42) ``` ```{r cv, eval = FALSE} cv <- calculateCrossValidation(movies_scaled, method = "Wilcoxon", number_of_folds = 7, output_to_file = FALSE, seed = 42) cv$statistical_significance ``` The significance labels are: | Label | Meaning | |---|---| | `(p<0.05*)` | The two adjacent solutions are significantly different at the 5% level | | `(p<0.1)` | Significant at the 10% level | | `n.s.` | Not significantly different | | `n.a.` | Not available (fold count outside the 5--10 range) | `plotCrossValidation()` displays the fold-wise SRD values as box-and-whisker plots, ordered by median, making it easy to see which solutions are stable across folds and which are sensitive to the subset of films used. ```{r cv-plot, eval = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Box-whisker plot of cross-validation SRD values by solution"} plotCrossValidation(cv) ``` ```{r cv-plot-hidden, echo = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Box-whisker plot of cross-validation SRD values by solution"} plotCrossValidation(cv) ``` The box-whisker plot orders the scoring systems from left to right by increasing median SRD. `Awards` and `Nominations` occupy the two leftmost positions, confirming their strong agreement with the `Recognition score` reference --- though as noted above, this is partly a mathematical consequence of how the reference is constructed. The pairwise Wilcoxon test finds that `Awards` and `Nominations` are not significantly different from each other (`n.s.`), but `Nominations` and `IMDB` *are* significantly different (`p<0.05`), marking a meaningful boundary between the award-derived measures and the independent scoring systems. `IMDB` and `RT Reviewers` occupy the middle positions and are not significantly different from each other (`n.s.`), nor is `RT Reviewers` significantly different from `RT Audience` (`n.s.`). This suggests that while critics and dedicated film raters (IMDB) track award recognition somewhat more closely than general audiences, the difference between these three systems is not statistically compelling with this dataset. Note that only consecutive methods are tested. Whether the difference between `IMDb` and the `RT Audience` score is statistically significant is not reported in this analysis. The most striking finding is the significant gap between `RT Audience` and `Box Office` (`p<0.05`), confirming that commercial box office performance is not merely different from the Recognition reference by degree but is categorically more distant than all other scoring systems. `Box Office` sits furthest from the reference, reinforcing the conclusion from the permutation test, i.e. award committees and general audiences evaluate films by fundamentally different criteria. Note that with only 14 films and 7 folds, each fold omits just 2 films. Wide boxes indicate scoring systems whose SRD values are sensitive to which 2 films are excluded --- films like *The Shawshank Redemption* (high IMDB and RT scores but low box office) are likely influential observations that affect fold-wise results considerably. --- ## Heatmap visualisation `plotHeatmapSRD()` provides a complementary view of the results by displaying the pairwise distances between all scoring systems --- including the reference --- as a colour-coded heatmap. By default, the heatmap uses a red-to-blue gradient, with red indicating SRD = 0 and blue indicating SRD = 1. Users may supply custom palettes, and the number of colour breaks corresponds to the palette length. ```{r heatmap, eval = FALSE} plotHeatmapSRD(movies_scaled) ``` ```{r heatmap-hidden, echo = FALSE, fig.width = 7, fig.height = 6, fig.alt = "Heatmap of pairwise SRD distances between scoring systems"} plotHeatmapSRD(movies_scaled) ``` The heatmap confirms the findings of the permutation test and cross-validation. `Awards` and `Nominations` cluster closely together and sit nearest to the Recognition reference, while `Box Office` is most distant from all other systems. A particularly striking feature is the distance between `RT Audience` and `Box Office`, despite both reflecting the judgement of general audiences rather than critics or award committees. This apparent contradiction dissolves once we distinguish between two fundamentally different modes of audience behaviour. Buying a cinema ticket is an *anticipatory* decision made before seeing a film, shaped by marketing, star power, and word of mouth. Rating a film on Rotten Tomatoes is a *reflective* decision made after viewing, shaped by the experience of the film itself. The 1994 dataset illustrates this gap rather vividly. *The Mask* and *Speed* were heavily marketed blockbusters that sold large numbers of tickets but received more measured audience ratings on reflection, while *The Shawshank Redemption* --- a commercial disappointment on release --- consistently receives some of the highest audience ratings ever recorded. Award committees, critics, and reflective online audiences converge in their evaluations, whereas box office receipts apparently tell a different story. --- ## A note on reproducibility `calculateSRDDistribution()` and `calculateCrossValidation()` both rely on random number generation in C++. Without a seed the results vary slightly between runs. For small datasets this variability can affect the precise values of XX1 and XX19. The following example makes the issue concrete: ```{r repro-demo, eval = FALSE} # Two unseeded runs -- XX1 may differ slightly sim_a <- calculateSRDDistribution(movies_scaled) sim_b <- calculateSRDDistribution(movies_scaled) cat("Run A -- XX1:", sim_a$xx1, " XX19:", sim_a$xx19, "\n") cat("Run B -- XX1:", sim_b$xx1, " XX19:", sim_b$xx19, "\n") # Two seeded runs -- results are identical sim_1 <- calculateSRDDistribution(movies_scaled, seed = 42) sim_2 <- calculateSRDDistribution(movies_scaled, seed = 42) cat("Seed 42, run 1 -- XX1:", sim_1$xx1, " XX19:", sim_1$xx19, "\n") cat("Seed 42, run 2 -- XX1:", sim_2$xx1, " XX19:", sim_2$xx19, "\n") ``` For any analysis that will be published or shared, we strongly recommend to always specify a seed. The choice of seed value does not affect the statistical validity of the results, it only guarantees their reproducibility. --- ## Summary of the workflow ```{r workflow-summary, eval = FALSE} # 1. Load data (last column = reference) path <- system.file("extdata", "movies1994.csv", package = "rSRD") movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1, check.names = FALSE) # 2. Preprocess movies_scaled <- utilsPreprocessDF(movies, method = "range_scale") # 3. Compute SRD values srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE) # 4. Permutation test (set seed for reproducibility) sim <- calculateSRDDistribution(movies_scaled, seed = 42) plotPermTest(movies_scaled, sim) # 5. Cross-validation cv <- calculateCrossValidation(movies_scaled, method = "Wilcoxon", number_of_folds = 7, output_to_file = FALSE, seed = 42) plotCrossValidation(cv) ``` --- ## References Héberger K., Kollár-Hunek K. (2011) "Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers", *Journal of Chemometrics*, **25**(4), pp. 151--158. Sziklai B.R., Baranyi M., Héberger, K. (2025) "Does cross-validation work in telling rankings apart?", *Cent Eur J Oper Res*, **33**, pp. 1503--1528.