From the
dependency network of all CRAN packages, we have seen that reverse
dependencies of all CRAN packages seem to follow the power law. This
echoes the phenomenon in other software dependencies observed by, for
example, LaBelle and
Wallingford, 2004, Baxter et al., 2006,
Jenkins and Kirk,
2007, Wu et
al. 2007, Louridas
et al., 2008, Zheng et al.,
2008, Kohring,
2009, Li et
al., 2013, Bavota et al. 2015,
and Cox et al.,
2015. In this vignette, we will fit the discrete power law to model
this number of reverse dependencies, using functions with the suffix
upp
, namely dupp()
, Supp()
and
mcmc_upp()
. While we shall focus on “Imports”, which is one
of the serveral kinds of dependencies in R, the same analysis can be
carried out for all other types.
In the vignette on all CRAN pacakges, we
looked at the dependency type Depends
. Here, we look at the
dependency type Imports
, using the same workflow. As we are
using the forward dependencies to construct the network, the number of
reverse dependencies is equivalent to the in-degree of a package, which
can be obtained via the function igraph::degree()
. We then
construct a data frame to hold this in-degree information.
g0.imports <- get_graph_all_packages(type = "imports")
d0.imports <- g0.imports |> igraph::degree(mode = "in")
df0.imports <-
data.frame(name = names(d0.imports), degree = as.integer(d0.imports)) |>
dplyr::arrange(dplyr::desc(degree), name)
head(df0.imports, 10)
#> name degree
#> 1 dplyr 3676
#> 2 ggplot2 3458
#> 3 Rcpp 2693
#> 4 rlang 2499
#> 5 magrittr 2082
#> 6 stringr 1866
#> 7 tidyr 1844
#> 8 tibble 1761
#> 9 purrr 1684
#> 10 MASS 1641
For the purpose of verification, we use the reverse dependencies to construct the network, then look at the out-degrees of the packages this time.
g0.rev_imports <- get_graph_all_packages(type = "imports", reverse = TRUE)
d0.rev_imports <- g0.rev_imports |> igraph::degree(mode = "out") # note the difference to above
df0.rev_imports <-
data.frame(name = names(d0.rev_imports), degree = as.integer(d0.rev_imports)) |>
dplyr::arrange(dplyr::desc(degree), name)
head(df0.rev_imports, 10)
#> name degree
#> 1 dplyr 3676
#> 2 ggplot2 3458
#> 3 Rcpp 2693
#> 4 rlang 2499
#> 5 magrittr 2082
#> 6 stringr 1866
#> 7 tidyr 1844
#> 8 tibble 1761
#> 9 purrr 1684
#> 10 MASS 1641
Theoretically, the two data frames are identical. Any possible (but small) difference is due to the CRAN pages being updated while scraping.
We construct a data frame for the empirical frequencies and survival function at the whole range of data.
df1.imports <- df0.imports |>
dplyr::filter(degree > 0L) |> # to prevent warning when plotting on log-log scale
dplyr::count(degree, name = "frequency") |>
dplyr::arrange(degree) |>
dplyr::mutate(survival = 1.0 - cumsum(frequency) / sum(frequency))
Before fitting the discrete power law, to be determined first is a threshold above which it is appropriate. We will visualise the degree distribution to determine such threshold.
plot_log_log <- function(df) {
## useful function for below
df |>
ggplot2::ggplot() +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10() +
ggplot2::theme_bw(12)
}
gg0 <- df1.imports |>
plot_log_log() +
ggplot2::geom_point(aes(degree, frequency), size = 0.75) +
ggplot2::coord_cartesian(ylim = c(1L, 2e+3L))
gg0
The power law seems appropriate for the whole range of data, and so, for illustration purposes, the threshold will be set at 1 inclusive. 0’s will be excluded anyway because, as we will see, the probability mass function (PMF) is not well-defined at 0.
While it is straightforward to determine the threshold here, such linearity over the whole range might not be seen for other data. The package poweRlaw provides functions for and references to more systematic/objective procedures of selecting the threshold.
We use the function mcmc_pol()
to fit the discrete power
law, of which the PMF is proportional to x−α, where α is the lone scalar parameter.
The Bayesian approach is used here for inference, meaning that a prior has to be set for the parameters. We assume a normal distribution N(mα = 0, sα = 10) for α. Markov chain Monte Carlo (MCMC) is used as the inference algorithm.
x <- df1.imports$degree
counts <- df1.imports$frequency
alpha0 <- 1.5 # initial value
theta0 <- 1.0 # initial value, but fixed to 1.0 for discrete power law
m_alpha <- 0.0 # prior mean
s_alpha <- 10.0 # prior standard deviation
set.seed(3075L)
mcmc0.imports <-
mcmc_pol(
x = x,
count = counts,
alpha = alpha0,
theta = theta0,
a_alpha = m_alpha,
b_alpha = s_alpha,
a_theta = 1.0,
b_theta = 1.0,
a_pseudo = 10.0,
b_pseudo = 1.0,
pr_power = 1.0,
iter = 2500L,
thin = 1L,
burn = 500L,
freq = 500L,
invt = 1.0,
mc3_or_marg = TRUE,
x_max = 100000
) # takes seconds
Now we have the samples representing the posterior distribution of α:
This means the number of reverse “Imports” follows approximately a
power law with exponent 1.65. We can also obtain the posterior density
via numerical integration using marg_pol()
. The returned
list contains log.marginal
for the marginal log-likelihood,
and posterior
for the posterior density in a data frame. We
verify its alignment with the MCMC results:
marg0.imports <-
marg_pow(
data.frame(x = x, count = counts),
lower = 1.001,
upper = 20.0,
m_alpha = m_alpha,
s_alpha = s_alpha
)
ggplot2::last_plot() +
ggplot2::geom_line(ggplot2::aes(alpha, density), marg0.imports$posterior, col = 2) +
ggplot2::coord_cartesian(xlim = range(mcmc0.imports$pars$alpha))
We can also calculate the fitted frequencies and survival function,
using dpol()
and Spol()
respectively. Their
summaries are however readily available in the MCMC output. We overlay
the fitted line (blue, dashed) and credible intervals (red, dotted) on
the plot above to check goodness-of-fit:
n0 <- sum(df1.imports$frequency) # TOTAL number of data points in x
## or n0 <- length(x)
df0.fitted <- mcmc0.imports$fitted
gg1 <- df1.imports |>
plot_log_log() +
ggplot2::geom_point(aes(degree, frequency), size = 0.75) +
ggplot2::geom_line(aes(x, f_med * n0), data = df0.fitted, col = 4, lty = 2) +
ggplot2::geom_line(aes(x, f_025 * n0), data = df0.fitted, col = 2, lty = 3) +
ggplot2::geom_line(aes(x, f_975 * n0), data = df0.fitted, col = 2, lty = 3) +
ggplot2::coord_cartesian(ylim = c(1L, 2e+3L))
gg1
The corresponding survival function according to the power law fit can also be obtained:
gg2 <- df1.imports |>
plot_log_log() +
ggplot2::geom_point(aes(degree, survival), size = 0.75) +
ggplot2::geom_line(aes(x, S_med), data = df0.fitted, col = 4, lty = 2) +
ggplot2::geom_line(aes(x, S_025), data = df0.fitted, col = 2, lty = 3) +
ggplot2::geom_line(aes(x, S_975), data = df0.fitted, col = 2, lty = 3)
gg2
#> Warning in ggplot2::scale_y_log10(): log-10 transformation introduced infinite
#> values.
This shows the discrete power law doesn’t fit as good as it seems
according to the frequency plot. To improve the fit, we use the discrete
extreme value mixture distributions introduced in Lee, Eastoe and Farrell,
2024. Provided in this package for such purpose are
dmix2()
, Smix2()
and mcmc_mix2()
.
As it takes hours to run the inference algorithm using
mcmc_mix()
, we hardcode the parameter values based on their
component-wise posterior median, and overlay the data by the fitted
line.
gg3 <- df1.imports |>
filter(degree > 1L) |>
dplyr::mutate(
survival = 1.0 - cumsum(frequency) / sum(frequency),
survival.mix = Smix2(degree, 1606, 1.73, 1.00, 0.237, 4.03, 0.003)
) |>
ggplot2::ggplot() +
ggplot2::geom_point(aes(degree, survival), size = 0.75) +
ggplot2::geom_line(aes(degree, survival.mix), col = 4, lty = 2, lwd = 1.2) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10() +
ggplot2::theme_bw(12)
gg3
#> Warning in ggplot2::scale_y_log10(): log-10 transformation introduced infinite
#> values.
This illustrates the mixture distribution’s better fit.