An important statistical question deals with the measurement and
testing of an association between two paired continuous measures. The
product-moment correlation r is a widely used statistic to
examine bivariate associations. However, the product-moment correlation
depends on Gaussian assumptions for statistical tests. It is also not a
robust measure because it is strongly influenced by any extreme outlier
score for either of the two variates. A rank-based statistic can avoid
both the problem of outlier sensitivity and the problem of being
dependent upon the Gaussian model. The
dfba_bivariate_concordance()
function provides a Bayesian
distribution-free concordance metric for characterizing the association
between the two measures.
To illustrate the nonparametric concepts of concordance and discordance, consider a specific example where there are five paired scores:
x | y |
---|---|
3.8 | 5.9 |
4.7 | -4.1 |
4.7 | 7.3 |
4.7 | 7.3 |
11.8 | 38.9 |
The ranks for the x variate are 1, 3, 3, 3, and 5 and the corresponding ranks for y are 2, 1, 3.5, 3.5, and 5, so, the five pairs in terms of their ranks and represented as points are P1 = (1, 2), P2 = (3, 1), P3 = (3, 3.5), P4 = (3, 3.5) and P5 = (5, 5). Let Rxi and Ryi be the respective rank values for the x and y variates for point i. The relationship between any two of these points Pi and Pj, is either (1) concordant if the sign of Rxi − Rxj is the same as the sign of Ryi − Ryj, (2) discordant if signs are different between Rxi − Rxj and Ryi − Ryj, or (3) null if either Rxi = Rxj or if Ryi = Ryj. For this example, there are ten possible comparisons among the five points; six are concordant, one is discordant, and there are three comparisons lost due to ties. In general, given n bivariate scores, there are n(n − 1)/2 total possible comparisons. When there are ties in the x variate, there is a loss of Tx comparisons, and when there are ties in the y variate, there are a separate Ty lost comparisons. Ties in both x and y are denoted as Txy. The total number of possible comparisons, accounting for ties, is therefore: n(n − 1)/2 − Tx − Ty + Txy, where Txy is added to avoid double-counting of lost comparisons. For the example above, there are three lost comparisons due to ties in x (i.e., Tx = 3), one lost comparison due to a tie in y (i.e., Ty = 1), and one comparison lost to a tie in both the x and y variates (i.e., Txy = 1). Thus, there are [(5 * 4)/2] − 3 − 1 + 1 = 7 valid comparisons. The τA correlation is defined as (nc − nd)/(nc + nd), which is a value on the [−1, 1] interval. This coefficient is also called the Kendall tau-A (τA) correlation. It is important to note that Kendall also used a different coefficient that has come to be called tau-B (τB). The tau-B correlation is defined as:
Unfortunately, the τB formula does
not properly correct for tied scores, which is regrettable because τB is the value
returned using the cor()
and cor.test()
functions from the stats
package using the
method = "kendall"
argument (i.e.,
cor(x, y, method = "kendall")
;
cor.test(x, y, method = "kendall")
). If there are no ties,
then Tx = Ty = Txy = 0,
and τA = τB,
but if there are ties, then the coefficient that properly
corrects for ties is τA. The
dfba_bivariate_concordance()
function provides the proper
correction for tied scores and outputs a sample estimate for the
frequentist τA rather than
τB.
The focus for the Bayesian analysis is on the population proportion
of concordance, which is the limit of the ratio nc/(nc + nd).
This proportion is a value on the [0, 1] interval, and it is called ϕ. The ϕ parameter is also connected to the
population τA because τA = 2ϕ − 1.
Moreover, Chechile (2020) showed that the likelihood function for
observing nc concordant
changes and nd discordant
changes is a censored Bernoulli process because order is a
property that must satisfy a transitivity requirement. Therefore, the
likelihood given a value for ϕ
is Kcϕnc(1 − ϕ)nd
where Kc
is the number of transitive arrangements with nc concordant
comparisons and nd discordant
comparisons. In Bayesian statistics, the likelihood function is only
specified as a proportional function because the number of possible
arrangements for observing nc concordance
changes and nd discordance
changes cancel out in Bayes theorem (i.e., the number Kc is in both
the numerator and the denominator of Bayes theorem, so it cancels). If
the prior for ϕ is a beta
distribution, then it follows that the posterior is also a beta
distribution (i.e., the beta is a natural Bayesian conjugate
function for Bernoulli processes). The default prior for the
dfba_bivariate_concordance()
function is the flat prior: a
beta distribution with shape parameters a0 = 1 and b0 = 1.
dfba_bivariate_concordance()
FunctionThe dfba_bivariate_concordance()
function has two
required arguments – x
and y
– that are
two paired vectors. Because these vectors are paired, each
score for x[i]
is linked with the corresponding score for
y[i]
.
The dfba_bivariate_concordance()
function also has four
optional arguments; listed with their respective default values, they
are: a0 = 1
, b0 = 1
,
prob_interval = .95
, and
fitting.parameters = NULL
. The arguments a0
and b0
represent the shape parameters (a0 and b0) for the prior beta
distribution to be assumed for the Bayesian analysis of the population
ϕ concordance proportion; the
default value of 1 for both of these
parameters corresponds to a uniform prior distribution (as noted above). Another optional argument is
prob_interval()
. This input allows the user to set the
proportion used for the interval estimate for the ϕ parameter; the default value is
.95. The last optional argument is
fitting_parameters()
, which has a default of
NULL
. This argument is only used when the user is
attempting to fit a mathematical model to the continuous univariate
problem.
An example for this type of problem will be examined later, but first let us see the results from a more typical bivariate association analysis.
x <- c(47, 39, 47, 42, 44, 46, 39, 37, 29, 42, 54, 33, 44, 31, 28, 49, 32, 37, 46, 55, 31)
y <- c(36, 40, 49, 45, 30, 38, 39, 44, 27, 48, 49, 51, 27, 36, 30, 44, 42, 41, 35, 49, 33)
A <- dfba_bivariate_concordance(x,
y)
A
#> Descriptive Statistics
#> ========================
#> Concordant Pairs Discordant Pairs
#> 128 68
#> Proportion of Concordant Pairs
#> 0.6530612
#>
#> Frequentist Analyses
#> ========================
#> Tau_A
#> 0.3061224
#>
#> Bayesian Analyses
#> ========================
#> Posterior Beta Shape Parameters for the Phi Concordance Measure
#> a_post b_post
#> 129 69
#> Posterior Median
#> 0.6520263
#> 95% Equal-tail interval limits:
#> Lower Limit Upper Limit
#> 0.583946 0.7161852
In the special case where the user has a model for predicting a
variate in terms of known quantities and where there are free-fitting
parameters, the dfba_bivariate_concordance()
function can
provide a distribution-free measure of the goodness-of-fit of the
scientific model. For this type of application, the bivariate pair are
the observed values of a variate along with the corresponding
predicted values from the scientific model. The concordance
proportion must be adjusted in these goodness-of-fit applications to
take into account the number of free parameters that were used in the
prediction model. Chechile and Barch (2022) argued that the fitting
parameters increase the number of concordant changes.
Consequently, the value for n_c
is
downward-adjusted as a function of the number of free
parameters. The Chechile-Barch adjusted n_c
value for a
case where there are m free
fitting parameters is nc − (n * m) + [m * (m + 1)/2].
As an example, suppose that there are n = 20 scores, and the prediction
equation has m = 2 free
parameters that result in creating a prediction for each observed score
(i.e., there are 20 paired
values of observed score x and
predicted score y), and
further suppose that this model results in nc = 170 and
nd = 20.
The value of nd is kept at
20, but the number of concordant
changes is reduced to 170 − (20 * 2) + (2 * 3/2) = 133.
# predicted values from model
p = seq(.05, .95, .05)
ypred= 17.332 - (50.261*p) + (48.308*p^2)
# Note: the coefficients in the ypred equation were found first via a polynomial regression
# observed values
yobs <- c(19.805, 10.105, 9.396, 8.219, 6.110, 4.543, 5.864, 4.861, 6.136, 5.789,
5.443, 5.548, 4.746, 6.484, 6.185, 6.202, 9.804, 9.332, 14.408)
B <- dfba_bivariate_concordance(x = yobs,
y = ypred,
fitting.parameters = 3)
B
#> Descriptive Statistics
#> ========================
#> Concordant Pairs Discordant Pairs
#> 142 29
#> Proportion of Concordant Pairs
#> 0.8304094
#>
#> Frequentist Analyses
#> ========================
#> Tau_A point estimate
#> 0.6608187
#>
#> Bayesian Analyses
#> ========================
#> Posterior Beta Shape Parameters for the Phi Concordance Measure
#> a_post b_post
#> 143 30
#> Posterior Median
#> 0.8278497
#> 95% Equal-tail interval limits:
#> Lower Limit Upper Limit
#> 0.766916 0.8791183
#>
#> Adjusted for number of model-fitting parameters
#> ------------------------
#> Beta Shape Parameters
#> a_post b_post
#> 92 30
#> Posterior Median
#> 0.7554904
#> 95% Equal-tail interval limits:
#> Lower Limit Upper Limit
#> 0.674262 0.8260471
Chechile, R.A. (2020). Bayesian Statistics for Experimental Scientists: A General Introduction Using Distribution-Free Methods. Cambridge: MIT Press.
Chechile, R. A., & Barch Jr., D.H. (2022). A distribution-free, Bayesian goodness-of-fit method for assessing similar scientific prediction equations. Journal of Mathematical Psychology, https://doi.org/10.1016/j.jmp.2021.102638