Title: | Various Coefficients of Interrater Reliability and Agreement |
---|---|
Description: | Coefficients of Interrater Reliability and Agreement for quantitative, ordinal and nominal data: ICC, Finn-Coefficient, Robinson's A, Kendall's W, Cohen's Kappa, ... |
Authors: | Matthias Gamer <[email protected]>, Jim Lemon <[email protected]>, Ian Fellows <[email protected]> Puspendra Singh <[email protected]> |
Maintainer: | Matthias Gamer <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.84.1 |
Built: | 2024-12-06 06:28:36 UTC |
Source: | CRAN |
Computes simple and extended percentage agreement among raters.
agree(ratings, tolerance=0)
agree(ratings, tolerance=0)
ratings |
n*m matrix or dataframe, n subjects m raters. |
tolerance |
number of successive rating categories that should be regarded as rater agreement (see details). |
Missing data are omitted in a listwise way.
Using extended percentage agreement (tolerance!=0) is only possible for numerical values. If tolerance equals 1, for example, raters differing by one scale degree are interpreted as agreeing.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
coefficient of interrater reliability. |
Matthias Gamer
kappa2
,
kappam.fleiss
,
kappam.light
data(video) agree(video) # Simple percentage agreement agree(video, 1) # Extended percentage agreement
data(video) agree(video) # Simple percentage agreement agree(video, 1) # Extended percentage agreement
The data frame contains the anxiety ratings of 20 subjects, rated by 3 raters. Values are ranging from 1 (not anxious at all) to 6 (extremely anxious).
data(anxiety)
data(anxiety)
A data frame with 20 observations on the following 3 variables.
ratings of the first rater
ratings of the second rater
ratings of the third rater
artificial data
data(anxiety) apply(anxiety,2,table)
data(anxiety) apply(anxiety,2,table)
Calculates the Bhapkar coefficient of concordance for two raters.
bhapkar(ratings)
bhapkar(ratings)
ratings |
n*2 matrix or dataframe, n subjects 2 raters. |
Missing data are omitted in a listwise way. The Bhapkar (1966) test is a more powerful alternative to the Stuart-Maxwell test. Both tests are asymptotically equivalent and will produce comparable chi-squared values when applied a large sample of rated objects.
A list with class "irrlist" containing the following components:
$method |
a character string describing the method. |
$subjects |
the number of data objects. |
$raters |
the number of raters. |
$irr.name |
the name of the coefficient (Chisq). |
$value |
the value of the coefficient. |
$stat.name |
the name and df of the test statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the probability of the test statistic. |
Matthias Gamer
Bhapkar, V.P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61, 228-235.
mcnemar.test
,
stuart.maxwell.mh
,
rater.bias
data(vision) bhapkar(vision) # Original example used from Bhapkar (1966)
data(vision) bhapkar(vision) # Original example used from Bhapkar (1966)
Psychiatric diagnoses of n=30 patients provided by different sets of m=6 raters. Data were used by Fleiss (1971) to illustrate the computation of Kappa for m raters.
data(diagnoses)
data(diagnoses)
A data frame with 30 observations (psychiatric diagnoses with levels 1. Depression, 2. Personality Disorder, 3. Schizophrenia, 4. Neurosis, 5. Other) on 6 variables representing different raters.
a factor including the diagnoses of rater 1 (levels see above)
a factor including the diagnoses of rater 2 (levels see above)
a factor including the diagnoses of rater 3 (levels see above)
a factor including the diagnoses of rater 4 (levels see above)
a factor including the diagnoses of rater 5 (levels see above)
a factor including the diagnoses of rater 6 (levels see above)
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
data(diagnoses) table(diagnoses[,1])
data(diagnoses) table(diagnoses[,1])
Computes the Finn coefficient as an index of the interrater reliability of quantitative data. Additionally, F-test and confidence interval are computed.
finn(ratings, s.levels, model = c("oneway", "twoway"))
finn(ratings, s.levels, model = c("oneway", "twoway"))
ratings |
n*m matrix or dataframe, n subjects m raters. |
s.levels |
the number of different rating categories. |
model |
a character string specifying if a '"oneway"' model (default) with row effects random, or a '"twoway"' model with column and row effects random should be applied. You can specify just the initial letter. |
Missing data are omitted in a listwise way.
The Finn coefficient is especially useful, when variance between raters is low (i.e. agreement is high).
For the computation it could be specified if only the subjects are considered as random effects ('"oneway"' model) or if subjects and raters are randomly chosen from a bigger pool of persons ('"twoway"' model).
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
coefficient of interrater reliability. |
$stat.name |
a character string specifying the name and the df of the corresponding F-statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the p-value for the test. |
Matthias Gamer
Finn, R.H. (1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71-76.
data(video) finn(video, 6, model="twoway")
data(video) finn(video, 6, model="twoway")
Computes single score or average score ICCs as an index of interrater reliability of quantitative data. Additionally, F-test and confidence interval are computed.
icc(ratings, model = c("oneway", "twoway"), type = c("consistency", "agreement"), unit = c("single", "average"), r0 = 0, conf.level = 0.95)
icc(ratings, model = c("oneway", "twoway"), type = c("consistency", "agreement"), unit = c("single", "average"), r0 = 0, conf.level = 0.95)
ratings |
n*m matrix or dataframe, n subjects m raters. |
model |
a character string specifying if a '"oneway"' model (default) with row effects random, or a '"twoway"' model with column and row effects random should be applied. You can specify just the initial letter. |
type |
a character string specifying if '"consistency"' (default) or '"agreement"' between raters should be estimated. If a '"oneway"' model is used, only '"consistency"' could be computed. You can specify just the initial letter. |
unit |
a character string specifying the unit of analysis: Must be one of '"single"' (default) or '"average"'. You can specify just the initial letter. |
r0 |
specification of the null hypothesis r = r0. Note that a one sided test (H1: r > r0) is performed. |
conf.level |
confidence level of the interval. |
Missing data are omitted in a listwise way.
When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout & Fleiss, 1979):
1. Should only the subjects be considered as random effects ('"oneway"' model) or are subjects and raters randomly chosen from a bigger pool of persons ('"twoway"' model).
2. If differences in judges' mean ratings are of interest, interrater '"agreement"' instead of '"consistency"' should be computed.
3. If the unit of analysis is a mean of several ratings, unit should be changed to '"average"'. In most cases, however, single values (unit='"single"') are regarded.
A list with class '"icclist"' containing the following components:
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$model |
a character string describing the selected model for the analysis. |
$type |
a character string describing the selected type of interrater reliability. |
$unit |
a character string describing the unit of analysis. |
$icc.name |
a character string specifying the name of ICC according to McGraw & Wong (1996). |
$value |
the intraclass correlation coefficient. |
$r0 |
the specified null hypothesis. |
$Fvalue |
the value of the F-statistic. |
$df1 |
the numerator degrees of freedom. |
$df2 |
the denominator degrees of freedom. |
$p.value |
the p-value for a two-sided test. |
$conf.level |
the confidence level for the interval. |
$lbound |
the lower bound of the confidence interval. |
$ubound |
the upper bound of the confidence interval. |
Matthias Gamer
Bartko, J.J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19, 3-11.
McGraw, K.O., & Wong, S.P. (1996), Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30-46.
Shrout, P.E., & Fleiss, J.L. (1979), Intraclass correlation: uses in assessing rater reliability. Psychological Bulletin, 86, 420-428.
data(anxiety) icc(anxiety, model="twoway", type="agreement") r1 <- round(rnorm(20, 10, 4)) r2 <- round(r1 + 10 + rnorm(20, 0, 2)) r3 <- round(r1 + 20 + rnorm(20, 0, 2)) icc(cbind(r1, r2, r3), "twoway") # High consistency icc(cbind(r1, r2, r3), "twoway", "agreement") # Low agreement
data(anxiety) icc(anxiety, model="twoway", type="agreement") r1 <- round(rnorm(20, 10, 4)) r2 <- round(r1 + 10 + rnorm(20, 0, 2)) r3 <- round(r1 + 20 + rnorm(20, 0, 2)) icc(cbind(r1, r2, r3), "twoway") # High consistency icc(cbind(r1, r2, r3), "twoway", "agreement") # Low agreement
Computes iota as an index of interrater agreement of quantitative or nominal multivariate observations.
iota(ratings, scaledata = c("quantitative","nominal"), standardize = FALSE)
iota(ratings, scaledata = c("quantitative","nominal"), standardize = FALSE)
ratings |
list of n*m matrices or dataframes with one list element for each variable, n subjects m raters. |
scaledata |
a character string specifying if the data is '"quantitative"' (default) or '"nominal"'. If the data is organized in factors, '"nominal"' is chosen automatically. You can specify just the initial letter. |
standardize |
a logical indicating whether quantitative data should be z-standardized within each variable before the computation of iota. |
Each list element must contain observations for each rater and subject without missing values.
In case of one categorical variable (only one list element), iota reduces to the Fleiss exact kappa coefficient, which was proposed by Conger (1980).
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
value of iota. |
$detail |
a character string specifying if the values were z-standardized before the computation of iota. |
Matthias Gamer
Conger, A.J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322-328.
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277-289.
data(diagnoses) iota(list(diagnoses)) # produces the same result as... kappam.fleiss(diagnoses, exact=TRUE) # Example from Janson & Olsson (2001), Table 1 photo <- list() photo[[1]] <- cbind(c( 71, 73, 86, 59, 71), # weight ratings c( 74, 80,101, 62, 83), c( 76, 80, 93, 66, 77)) photo[[2]] <- cbind(c(166,160,187,161,172), # height rating c(171,170,174,163,182), c(171,165,185,162,181)) iota(photo) iota(photo, standardize=TRUE) # iota over standardized values
data(diagnoses) iota(list(diagnoses)) # produces the same result as... kappam.fleiss(diagnoses, exact=TRUE) # Example from Janson & Olsson (2001), Table 1 photo <- list() photo[[1]] <- cbind(c( 71, 73, 86, 59, 71), # weight ratings c( 74, 80,101, 62, 83), c( 76, 80, 93, 66, 77)) photo[[2]] <- cbind(c(166,160,187,161,172), # height rating c(171,170,174,163,182), c(171,165,185,162,181)) iota(photo) iota(photo, standardize=TRUE) # iota over standardized values
Calculates Cohen's Kappa and weighted Kappa as an index of interrater agreement between 2 raters on categorical (or ordinal) data. Own weights for the various degrees of disagreement could be specified.
kappa2(ratings, weight = c("unweighted", "equal", "squared"), sort.levels = FALSE)
kappa2(ratings, weight = c("unweighted", "equal", "squared"), sort.levels = FALSE)
ratings |
n*2 matrix or dataframe, n subjects 2 raters. |
weight |
either a character string specifying one predifined set of weights or a numeric vector with own weights (see details). |
sort.levels |
boolean value describing whether factor levels should be (re-)sorted during the calculation. |
Missing data are omitted in a listwise way.
During computation, ratings are converted to factors. Therefore, the categories are ordered accordingly. When ratings are numeric, a sorting of factor levels occurs automatically. Otherwise, levels are sorted when the function is called with sort.levels=TRUE. kappa2
allows for calculating weighted Kappa coefficients. Beneath '"unweighted"' (default), predifined sets of weights are '"equal"' (all levels disagreement between raters are weighted equally) and '"squared"' (disagreements are weighted according to their squared distance from perfect agreement). The weighted Kappa coefficient with '"squared"' weights equals the product moment correlation under certain conditions.
Own weights could be specified by supplying the function with a numeric vector of weights, starting from perfect agreement to worst disagreement. The length of this vector must equal the number of rating categories.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method and the weights applied for the computation of weighted Kappa. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters (=2). |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
value of Kappa. |
$stat.name |
a character string specifying the name of the corresponding test statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the p-value for the test. |
Matthias Gamer
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220.
Fleiss, J.L., Cohen, J., & Everitt, B.S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323-327.
data(anxiety) kappa2(anxiety[,1:2], "squared") # predefined set of squared weights kappa2(anxiety[,1:2], (0:5)^2) # same result with own set of squared weights # own weights increasing gradually with larger distance from perfect agreement kappa2(anxiety[,1:2], c(0,1,2,4,7,11)) data(diagnoses) # Unweighted Kappa for categorical data without a logical order kappa2(diagnoses[,2:3])
data(anxiety) kappa2(anxiety[,1:2], "squared") # predefined set of squared weights kappa2(anxiety[,1:2], (0:5)^2) # same result with own set of squared weights # own weights increasing gradually with larger distance from perfect agreement kappa2(anxiety[,1:2], c(0,1,2,4,7,11)) data(diagnoses) # Unweighted Kappa for categorical data without a logical order kappa2(diagnoses[,2:3])
Computes Fleiss' Kappa as an index of interrater agreement between m raters on categorical data. Additionally, category-wise Kappas could be computed.
kappam.fleiss(ratings, exact = FALSE, detail = FALSE)
kappam.fleiss(ratings, exact = FALSE, detail = FALSE)
ratings |
n*m matrix or dataframe, n subjects m raters. |
exact |
a logical indicating whether the exact Kappa (Conger, 1980) or the Kappa described by Fleiss (1971) should be computed. |
detail |
a logical indicating whether category-wise Kappas should be computed |
Missing data are omitted in a listwise way.
The coefficient described by Fleiss (1971) does not reduce to Cohen's Kappa (unweighted) for m=2 raters. Therefore, the exact Kappa coefficient, which is slightly higher in most cases, was proposed by Conger (1980).
The null hypothesis Kappa=0 could only be tested using Fleiss' formulation of Kappa.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
value of Kappa. |
$stat.name |
a character string specifying the name of the corresponding test statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the p-value for the test. |
$detail |
a table with category-wise kappas and the corresponding test statistics. |
Matthias Gamer
Conger, A.J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322-328.
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
Fleiss, J.L., Levin, B., & Paik, M.C. (2003). Statistical Methods for Rates and Proportions, 3rd Edition. New York: John Wiley & Sons.
data(diagnoses) kappam.fleiss(diagnoses) # Fleiss' Kappa kappam.fleiss(diagnoses, exact=TRUE) # Exact Kappa kappam.fleiss(diagnoses, detail=TRUE) # Fleiss' and category-wise Kappa kappam.fleiss(diagnoses[,1:4]) # Fleiss' Kappa of raters 1 to 4
data(diagnoses) kappam.fleiss(diagnoses) # Fleiss' Kappa kappam.fleiss(diagnoses, exact=TRUE) # Exact Kappa kappam.fleiss(diagnoses, detail=TRUE) # Fleiss' and category-wise Kappa kappam.fleiss(diagnoses[,1:4]) # Fleiss' Kappa of raters 1 to 4
Computes Light's Kappa as an index of interrater agreement between m raters on categorical data.
kappam.light(ratings)
kappam.light(ratings)
ratings |
n*m matrix or dataframe, n subjects m raters. |
Missing data are omitted in a listwise way.
Light's Kappa equals the average of all possible combinations of bivariate Kappas between raters.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
value of Kappa. |
$stat.name |
a character string specifying the name of the corresponding test statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the p-value for the test. |
Matthias Gamer
Conger, A.J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322-328.
Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365-377.
data(diagnoses) kappam.light(diagnoses) # Light's Kappa
data(diagnoses) kappam.light(diagnoses) # Light's Kappa
Computes Kendall's coefficient of concordance as an index of interrater reliability of ordinal data. The coefficient could be corrected for ties within raters.
kendall(ratings, correct = FALSE)
kendall(ratings, correct = FALSE)
ratings |
n*m matrix or dataframe, n subjects m raters. |
correct |
a logical indicating whether the coefficient should be corrected for ties within raters. |
Missing data are omitted in a listwise way.
Kendall's W should be corrected for ties if raters did not use a true ranking order for the subjects.
A test for the significance of Kendall's W is only valid for large samples.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
coefficient of interrater reliability. |
$stat.name |
a character string specifying the name and the df of the corresponding chi-squared test. |
$statistic |
the value of the test statistic. |
$p.value |
the p-value for the test. |
$error |
the character string of a warning message if ties were found within raters. |
Matthias Gamer
Kendall, M.G. (1948). Rank correlation methods. London: Griffin.
data(anxiety) kendall(anxiety, TRUE)
data(anxiety) kendall(anxiety, TRUE)
calculates the alpha coefficient of reliability proposed by Krippendorff
kripp.alpha(x, method=c("nominal","ordinal","interval","ratio"))
kripp.alpha(x, method=c("nominal","ordinal","interval","ratio"))
x |
classifier x object matrix of classifications or scores |
method |
data level of x |
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method. |
$subjects |
the number of data objects. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
value of alpha. |
$stat.name |
here "nil" as there is no test statistic. |
$statistic |
the value of the test statistic (NULL). |
$p.value |
the probability of the test statistic (NULL). |
cm |
the concordance/discordance matrix used in the calculation of alpha |
data.values |
a character vector of the unique data values |
levx |
the unique values of the ratings |
nmatchval |
the count of matches, used in calculation |
data.level |
the data level of the ratings ("nominal","ordinal", "interval","ratio") |
Krippendorff's alpha coefficient is particularly useful where the level of measurement of classification data is higher than nominal or ordinal.
Jim Lemon
Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.
# the "C" data from Krippendorff nmm<-matrix(c(1,1,NA,1,2,2,3,2,3,3,3,3,3,3,3,3,2,2,2,2,1,2,3,4,4,4,4,4, 1,1,2,1,2,2,2,2,NA,5,5,5,NA,NA,1,1,NA,NA,3,NA),nrow=4) # first assume the default nominal classification kripp.alpha(nmm) # now use the same data with the other three methods kripp.alpha(nmm,"ordinal") kripp.alpha(nmm,"interval") kripp.alpha(nmm,"ratio")
# the "C" data from Krippendorff nmm<-matrix(c(1,1,NA,1,2,2,3,2,3,3,3,3,3,3,3,3,2,2,2,2,1,2,3,4,4,4,4,4, 1,1,2,1,2,2,2,2,NA,5,5,5,NA,NA,1,1,NA,NA,3,NA),nrow=4) # first assume the default nominal classification kripp.alpha(nmm) # now use the same data with the other three methods kripp.alpha(nmm,"ordinal") kripp.alpha(nmm,"interval") kripp.alpha(nmm,"ratio")
Computes Maxwell's RE as an index of the interrater agreement of binary data.
maxwell(ratings)
maxwell(ratings)
ratings |
n*2 matrix or dataframe, n subjects 2 raters. |
Missing data are omitted in a listwise way.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters (=2). |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
value of RE. |
Matthias Gamer
Maxwell, A.E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 79-83.
data(anxiety) # Median-split to generate binary data r1 <- ifelse(anxiety$rater1<median(anxiety$rater1),0,1) r2 <- ifelse(anxiety$rater2<median(anxiety$rater2),0,1) maxwell(cbind(r1,r2))
data(anxiety) # Median-split to generate binary data r1 <- ifelse(anxiety$rater1<median(anxiety$rater1),0,1) r2 <- ifelse(anxiety$rater2<median(anxiety$rater2),0,1) maxwell(cbind(r1,r2))
Computes the mean of bivariate Pearson's product moment correlations between raters as an index of the interrater reliability of quantitative data.
meancor(ratings, fisher = TRUE)
meancor(ratings, fisher = TRUE)
ratings |
n*m matrix or dataframe, n subjects m raters. |
fisher |
a logical indicating whether the correlation coefficients should be Fisher z-standardized before averaging. |
Missing data are omitted in a listwise way.
The mean of bivariate correlations should not be used as an index of interrater reliability when the variance of ratings differs between raters.
The null hypothesis r=0 could only be tested when Fisher z-standardized values are used for the averaging.
When computing Fisher z-standardized values, perfect correlations are omitted before averaging because z equals +/-Inf in that case.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
coefficient of interrater reliability. |
$stat.name |
a character string specifying the name of the corresponding test statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the p-value for the test. |
$error |
a character string specifying whether correlations were dropped before the computation of the Fisher z-standardized average. |
Matthias Gamer
data(anxiety) meancor(anxiety)
data(anxiety) meancor(anxiety)
Computes the mean of bivariate Spearman's rho rank correlations between raters as an index of the interrater reliability of ordinal data.
meanrho(ratings, fisher = TRUE)
meanrho(ratings, fisher = TRUE)
ratings |
n*m matrix or dataframe, n subjects m raters. |
fisher |
a logical indicating whether the correlation coefficients should be Fisher z-standardized before averaging. |
Missing data are omitted in a listwise way.
The mean of bivariate rank correlations should not be used as an index of interrater reliability when ties within raters occur.
The null hypothesis r=0 could only be tested when Fisher z-standardized values are used for the averaging.
When computing Fisher z-standardized values, perfect correlations are omitted before averaging because z equals +/-Inf in that case.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
coefficient of interrater reliability. |
$stat.name |
a character string specifying the name of the corresponding test statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the p-value for the test. |
$error |
a character specifying whether correlations were dropped before the computation of the Fisher z-standardized average. Additionally, a warning message is created if ties were found within raters. |
Matthias Gamer
data(anxiety) meanrho(anxiety, TRUE)
data(anxiety) meanrho(anxiety, TRUE)
This function is a sample size estimator for the Cohen's Kappa statistic for a binary outcome. Note that any value of "kappa under null" in the interval [0,1] is acceptable (i.e. k0=0 is a valid null hypothesis).
N.cohen.kappa(rate1, rate2, k1, k0, alpha=0.05, power=0.8, twosided=FALSE)
N.cohen.kappa(rate1, rate2, k1, k0, alpha=0.05, power=0.8, twosided=FALSE)
rate1 |
the probability that the first rater will record a positive diagnosis |
rate2 |
the probability that the second rater will record a positive diagnosis |
k1 |
the true Cohen's Kappa statistic |
k0 |
the value of kappa under the null hypothesis |
alpha |
type I error of test |
power |
the desired power to detect the difference between true kappa and hypothetical kappa |
twosided |
TRUE if test is two-sided |
returns required sample size
Ian Fellows
Cantor, A. B. (1996) Sample-size calculation for Cohen's kappa. Psychological Methods, 1, 150-153.
# Testing H0: kappa = 0.7 vs. HA: kappa > 0.7 given that # kappa = 0.85 and both raters classify 50% of subjects as positive. N.cohen.kappa(0.5, 0.5, 0.7, 0.85)
# Testing H0: kappa = 0.7 vs. HA: kappa > 0.7 given that # kappa = 0.85 and both raters classify 50% of subjects as positive. N.cohen.kappa(0.5, 0.5, 0.7, 0.85)
This function calculates the required sample size for the Cohen's Kappa statistic when two raters have the same marginal. Note that any value of "kappa under null" in the interval [-1,1] is acceptable (i.e. k0=0 is a valid null hypothesis).
N2.cohen.kappa(mrg, k1, k0, alpha=0.05, power=0.8, twosided=FALSE)
N2.cohen.kappa(mrg, k1, k0, alpha=0.05, power=0.8, twosided=FALSE)
mrg |
a vector of marginal probabilities given by raters |
k1 |
the true Cohen's Kappa statistic |
k0 |
the value of kappa under the null hypothesis |
alpha |
type I error of test |
power |
the desired power to detect the difference between true kappa and hypothetical kappa |
twosided |
TRUE if test is two-sided |
Returns required sample size.
Puspendra Singh and Jim Lemon
Flack, V.F., Afifi, A.A., Lachenbruch, P.A., & Schouten, H.J.A. (1988). Sample size determinations for the two rater kappa statistic. Psychometrika, 53, 321-325.
require(lpSolve) # Testing H0: kappa = 0.4 vs. HA: kappa > 0.4 (=0.6) given that # Marginal Probabilities by two raters are (0.2, 0.25, 0.55). # # one sided test with 80% power: N2.cohen.kappa(c(0.2, 0.25, 0.55), k1=0.6, k0=0.4) # one sided test with 90% power: N2.cohen.kappa(c(0.2, 0.25, 0.55), k1=0.6, k0=0.4, power=0.9) # Marginal Probabilities by two raters are (0.2, 0.05, 0.2, 0.05, 0.2, 0.3) # Testing H0: kappa = 0.1 vs. HA: kappa > 0.1 (=0.5) given that # # one sided test with 80% power: N2.cohen.kappa(c(0.2, 0.05, 0.2, 0.05, 0.2, 0.3), k1=0.5, k0=0.1)
require(lpSolve) # Testing H0: kappa = 0.4 vs. HA: kappa > 0.4 (=0.6) given that # Marginal Probabilities by two raters are (0.2, 0.25, 0.55). # # one sided test with 80% power: N2.cohen.kappa(c(0.2, 0.25, 0.55), k1=0.6, k0=0.4) # one sided test with 90% power: N2.cohen.kappa(c(0.2, 0.25, 0.55), k1=0.6, k0=0.4, power=0.9) # Marginal Probabilities by two raters are (0.2, 0.05, 0.2, 0.05, 0.2, 0.3) # Testing H0: kappa = 0.1 vs. HA: kappa > 0.1 (=0.5) given that # # one sided test with 80% power: N2.cohen.kappa(c(0.2, 0.05, 0.2, 0.05, 0.2, 0.3), k1=0.5, k0=0.1)
Prints the results of the ICC computation.
## S3 method for class 'icclist' print(x, ...)
## S3 method for class 'icclist' print(x, ...)
x |
a list with class '"icclist"' containing the results of the ICC computation. |
... |
further arguments passed to or from other methods. |
'"print.icclist"' is only a printing function and is usually not called directly.
Matthias Gamer
data(anxiety) # "print.icclist" is the default printing function of "icc" icc(anxiety, model="twoway", type="agreement")
data(anxiety) # "print.icclist" is the default printing function of "icc" icc(anxiety, model="twoway", type="agreement")
Prints the results of various functions computing coefficients of interrater reliability.
## S3 method for class 'irrlist' print(x, ...)
## S3 method for class 'irrlist' print(x, ...)
x |
a list with class '"irrlist"' containing the results of the interrater reliability computation. |
... |
further arguments passed to or from other methods. |
'"print.irrlist"' is only a printing function and is usually not called directly.
Matthias Gamer
bhapkar
,
finn
,
iota
,
kappa2
,
kappam.fleiss
,
kappam.light
,
kripp.alpha
,
kendall
,
maxwell
,
meancor
,
meanrho
,
rater.bias
,
robinson
,
stuart.maxwell
data(anxiety) # "print.irrlist" is the default printing method of various functions, e.g. finn(anxiety, 6) meancor(anxiety)
data(anxiety) # "print.irrlist" is the default printing method of various functions, e.g. finn(anxiety, 6) meancor(anxiety)
Calculates a coefficient of systematic bias between two raters.
rater.bias(x)
rater.bias(x)
x |
c x c classification matrix or 2 x n or n x 2 matrix of classification scores into c categories. |
rater.bias
calculates a reliability coefficient for two raters
classifying n objects into any number of categories. It will accept either
a c x c classification matrix of counts of objects falling into c categories
or a 2 x n or n x 2 matrix of classification scores.
The function returns the absolute value of the triangular off-diagnonal
sum ratio of the cxc classification table and the corresponding test statistic.
A systematic bias between raters can be assumed when the ratio substantially
deviates from 0.5 while yielding a significant Chi-squared statistic.
method |
Name of the method |
subjects |
Number of subjects |
raters |
Number of raters (2) |
irr.name |
Name of the coefficient: ratio of triangular off-diagnonal sums |
value |
Value of the coefficient |
stat.name |
Name of the test statistic |
statistic |
Value of the test statistic |
p.value |
the probability of the df 1 Chi-square variable |
Jim Lemon
Bishop Y.M.M., Fienberg S.E., & Holland P.W. (1978). Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press.
# fake a 2xn matrix of three way classification scores ratings <- matrix(sample(1:3,60,TRUE), nrow=2) rater.bias(ratings) # Example from Bishop, Fienberg & Holland (1978), Table 8.2-1 data(vision) rater.bias(vision)
# fake a 2xn matrix of three way classification scores ratings <- matrix(sample(1:3,60,TRUE), nrow=2) rater.bias(ratings) # Example from Bishop, Fienberg & Holland (1978), Table 8.2-1 data(vision) rater.bias(vision)
‘relInterIntra’ calculates inter- and intra-rater reliability coefficients.
relInterIntra(x, nrater=1, raterLabels=NULL, rho0inter=0.6, rho0intra=0.8, conf.level=.95)
relInterIntra(x, nrater=1, raterLabels=NULL, rho0inter=0.6, rho0intra=0.8, conf.level=.95)
x |
Data frame or matrix of rater by object scores |
nrater |
Number of raters |
raterLabels |
Labels for the raters or methods |
rho0inter |
Null hypothesis value for the inter-rater reliability coefficient |
rho0intra |
Null hypothesis value for the intra-rater reliability coefficient |
conf.level |
Confidence level for the one-sided confidence interval reported |
nil
Tore Wentzel-Larsen
Eliasziw, M., Young, S.L., Woodbury, M.G., & Fryday-Field, K. (1994). Statistical methodology for the concurrent assessment of interrater and intrarater reliability: Using goniometric measurements as an example. Physical Therapy, 74, 777-788.
# testing code for the Goniometer data from the article: table4<-matrix(c( -2,16,5,11,7,-7,18,4,0,0,-3,3,7,-6,1,-13,2,4,-10,8,7,-3,-5,5,0,7,-8,1,-3, 0,16,6,10,8,-8,19,5,-3,0,-2,-1,9,-7,1,-14,1,4,-9,9,6,-2,-5,5,-1,6,-8,1,-3, 1,15,6,10,6,-8,19,5,-2,-2,-2,1,9,-6,0,-14,0,3,-10,8,7,-4,-7,5,-1,6,-8,2,-3, 2,12,4,9,5,-9,17,5,-7,1,-4,-1,4,-8,-2,-12,-1,7,-10,2,8,-5,-6,3,-4,4,-10,1,-5, 1,14,4,7,6,-10,17,5,-6,2,-3,-2,4,-10,-2,-12,0,6,-11,8,7,-5,-8,4,-3,4,-11,-1,-4, 1,13,4,8,6,-9,17,5,-5,1,-3,1,2,-9,-3,-12,0,4,-10,8,7,-5,-7,4,-4,4,-10,0,-5 ),ncol=6) relInterIntra(x=table4,nrater=2,raterLabels=c('universal','Lamoreux'))
# testing code for the Goniometer data from the article: table4<-matrix(c( -2,16,5,11,7,-7,18,4,0,0,-3,3,7,-6,1,-13,2,4,-10,8,7,-3,-5,5,0,7,-8,1,-3, 0,16,6,10,8,-8,19,5,-3,0,-2,-1,9,-7,1,-14,1,4,-9,9,6,-2,-5,5,-1,6,-8,1,-3, 1,15,6,10,6,-8,19,5,-2,-2,-2,1,9,-6,0,-14,0,3,-10,8,7,-4,-7,5,-1,6,-8,2,-3, 2,12,4,9,5,-9,17,5,-7,1,-4,-1,4,-8,-2,-12,-1,7,-10,2,8,-5,-6,3,-4,4,-10,1,-5, 1,14,4,7,6,-10,17,5,-6,2,-3,-2,4,-10,-2,-12,0,6,-11,8,7,-5,-8,4,-3,4,-11,-1,-4, 1,13,4,8,6,-9,17,5,-5,1,-3,1,2,-9,-3,-12,0,4,-10,8,7,-5,-7,4,-4,4,-10,0,-5 ),ncol=6) relInterIntra(x=table4,nrater=2,raterLabels=c('universal','Lamoreux'))
Computes Robinson's A as an index of the interrater reliability of quantitative data.
robinson(ratings)
robinson(ratings)
ratings |
n*m matrix or dataframe, n subjects m raters. |
Missing data are omitted in a listwise way.
A list with class '"irrlist"' containing the following components:
$method |
a character string describing the method applied for the computation of interrater reliability. |
$subjects |
the number of subjects examined. |
$raters |
the number of raters. |
$irr.name |
a character string specifying the name of the coefficient. |
$value |
coefficient of interrater reliability. |
Matthias Gamer
Robinson, W.S. (1957). The statistical measurement of agreement. American Sociological Review, 22, 17-25.
data(anxiety) robinson(anxiety)
data(anxiety) robinson(anxiety)
Calculates the Stuart-Maxwell coefficient of concordance for two raters.
stuart.maxwell.mh(x)
stuart.maxwell.mh(x)
x |
c x c classification matrix or matrix of classification scores into c categories. |
stuart.maxwell.mh
calculates a reliability coefficient for two raters
classifying n objects into any number of categories. It will accept either
a c x c classification matrix of counts of objects falling into c categories
or a c x n or n x c matrix of classification scores.
A list with class "irrlist" containing the following components:
$method |
a character string describing the method. |
$subjects |
the number of data objects. |
$raters |
the number of raters. |
$irr.name |
the name of the coefficient (Chisq). |
$value |
the value of the coefficient. |
$stat.name |
the name and df of the test statistic. |
$statistic |
the value of the test statistic. |
$p.value |
the probability of the test statistic. |
Jim Lemon
Stuart, A.A. (1955). A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42, 412-416.
Maxwell, A.E. (1970) Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651-655.
# fake a 2xn matrix of three way classification scores ratings<-matrix(sample(1:3,60,TRUE), nrow=2) stuart.maxwell.mh(ratings) # Example used from Stuart (1955) data(vision) stuart.maxwell.mh(vision)
# fake a 2xn matrix of three way classification scores ratings<-matrix(sample(1:3,60,TRUE), nrow=2) stuart.maxwell.mh(ratings) # Example used from Stuart (1955) data(vision) stuart.maxwell.mh(vision)
The data frame contains the credibility ratings of 20 subjects, rated by 4 raters. Judgements could vary from 1 (not credible) to 6 (highly credible). Variance between and within raters is low.
data(video)
data(video)
A data frame with 20 observations on the following 4 variables.
ratings of rater 1
ratings of rater 2
ratings of rater 3
ratings of rater 4
artificial data
data(video) apply(video,2,table)
data(video) apply(video,2,table)
Case records of the eye-testing of N=7477 female employees in Royal Ordnance factories between 1943 and 1946. Data were primarily used by Stuart (1953) to illustrate the the estimation and comparison of strengths of association in contingency tables.
data(anxiety)
data(anxiety)
A data frame with 7477 observations (eye testing results with levels 1st grade, 2nd grade, 3rd grade, 4th Grade) on the following 2 variables.
unaided distance vision performance of the right eye
unaided distance vision performance of the left eye
Stuart, A. (1953). The Estimation and Comparison of Strengths of Association in Contingency Tables. Biometrika, 40, 105-110.
Stuart, A. (1953). The Estimation and Comparison of Strengths of Association in Contingency Tables. Biometrika, 40, 105-110.
data(vision) table(vision$r.eye, vision$l.eye)
data(vision) table(vision$r.eye, vision$l.eye)