Title: | Algorithmic Complexity for Short Strings |
---|---|
Description: | Main functionality is to provide the algorithmic complexity for short strings, an approximation of the Kolmogorov Complexity of a short string using the coding theorem method (see ?acss). The database containing the complexity is provided in the data only package acss.data, this package provides functions accessing the data such as prob_random returning the posterior probability that a given string was produced by a random process. In addition, two traditional (but problematic) measures of complexity are also provided: entropy and change complexity. |
Authors: | Nicolas Gauvrit [aut], Henrik Singmann [aut, cre], Fernando Soler Toscano [ctb], Hector Zenil [ctb] |
Maintainer: | Henrik Singmann <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2-5 |
Built: | 2024-11-08 06:18:38 UTC |
Source: | CRAN |
Functions to obtain the algorithmic complexity for short strings, an approximation of the Kolmogorov Complexity of a short string using the coding theorem method.
acss(string, alphabet = 9) local_complexity(string, alphabet = 9, span = 5) likelihood_d(string, alphabet = 9) likelihood_ratio(string, alphabet = 9) prob_random(string, alphabet = 9, prior= 0.5)
acss(string, alphabet = 9) local_complexity(string, alphabet = 9, span = 5) likelihood_d(string, alphabet = 9) likelihood_ratio(string, alphabet = 9) prob_random(string, alphabet = 9, prior= 0.5)
string |
|
alphabet |
|
prior |
|
span |
size of substrings to be created from |
The algorithmic complexity is computed using the coding theorem method: For a given alphabet size (number of different symbols in a string), all possible or a large number of random samples of Turing machines (TM) with a given number of states (e.g., 5) and number of symbols corresponding to the alphabet size were simulated until they reached a halting state or failed to end.
The outputs of the TMs at the halting states produces a distribution of strings known as the algorithmic probability of the strings. The algorithmic coding theorem (Levin, 1974) establishes the connection between the complexity of a string and its algorithmic probability
as:
This package accesses a database containing data on 4.5 million strings from length 1 to 12 simulated on TMs with 2, 4, 5, 6, and 9 symbols.
For a more detailed discussion see Gauvrit, Singmann, Soler-Toscano, and Zenil (2014), http://complexitycalculator.com/methodology.html, or references below.
A matrix in which the rows correspond to the strings entered and the columns to the algorithmic complexity K and the algorithmic probability D of the string (see http://complexitycalculator.com/methodology.html).
A list with elements corresponding to the strings. Each list containes a named vector of algorithmic complexities (K) of all substrings in each string with length span.
A named vector with the likelihoods for string
given a detreministic process.
A named vector with the likelihood ratios (or Bayes factors) for string
given a random rather than detreministic process.
A named vector with the posterior probabilities that for a random process given the strings and the provided prior for being produced by a random process (default is 0.5, which correspond to a prior of 1 - 0.5 = 0.5 for a detereministic process).
The first time per session one of the functions described here is used, a relatively large dataset is loaded into memory which can take a considerable amount of time (> 10 seconds).
Delahaye, J.-P., & Zenil, H. (2012). Numerical evaluation of algorithmic complexity for short strings: A glance into the innermost structure of randomness. Applied Mathematics and Computation, 219(1), 63-77. doi:10.1016/j.amc.2011.10.006
Gauvrit, N., Singmann, H., Soler-Toscano, F., & Zenil, H. (2014). Algorithmic complexity for psychology: A user-friendly implementation of the coding theorem method. arXiv:1409.4080 [cs, stat]. http://arxiv.org/abs/1409.4080.
Gauvrit, N., Zenil, H., Delahaye, J.-P., & Soler-Toscano, F. (2014). Algorithmic complexity for short binary strings applied to psychology: a primer. Behavior Research Methods, 46(3), 732-744. doi:10.3758/s13428-013-0416-0
Levin, L. A. (1974). Laws of information conservation (nongrowth) and aspects of the foundation of probability theory. Problemy Peredachi Informatsii, 10(3), 30-35.
Soler-Toscano, F., Zenil, H., Delahaye, J.-P., & Gauvrit, N. (2012). Calculating Kolmogorov Complexity from the Output Frequency Distributions of Small Turing Machines. PLoS ONE, 9(5): e96223.
# WARNING: The first call to one of the functions # discussed on this page loads a large data set # and usually takes > 10 seconds. Stay patient. acss(c("HEHHEE", "GHHGGHGHH", "HSHSHHSHSS")) ## K.9 D.9 ## HEHHEE 23.38852 9.106564e-08 ## GHHGGHGHH 33.50168 8.222205e-11 ## HSHSHHSHSS 35.15241 2.618613e-11 acss(c("HEHHEE", "GHHGGHGHH", "HSHSHHSHSS"))[,"K.9"] ## [1] 23.38852 33.50168 35.15241 acss(c("HEHHEE", "GHHGGHGHH", "HSHSHHSHSS"), alphabet = 2) ## K.2 D.2 ## HEHHEE 14.96921 3.117581e-05 ## GHHGGHGHH 25.60208 1.963387e-08 ## HSHSHHSHSS 26.90906 7.935321e-09 acss(c("HEHHEE", "GHHGGHGHUE", "HSHSHHSHSS"), NULL) ## K.2 K.4 K.5 K.6 K.9 ## HEHHEE 14.96921 18.55227 19.70361 20.75762 23.38852 ## GHHGGHGHUE NA 31.75832 33.00795 34.27457 37.78935 ## HSHSHHSHSS 26.90906 29.37852 30.52566 31.76229 35.15241 ## D.2 D.4 D.5 D.6 ## HEHHEE 3.117581e-05 2.601421e-06 1.171176e-06 5.640722e-07 ## GHHGGHGHUE NA 2.752909e-10 1.157755e-10 4.812021e-11 ## HSHSHHSHSS 7.935321e-09 1.432793e-09 6.469341e-10 2.745360e-10 ## D.9 ## HEHHEE 9.106564e-08 ## GHHGGHGHUE 4.209915e-12 ## HSHSHHSHSS 2.618613e-11 ## Not run: likelihood_d(c("HTHTHTHT", "HTHHTHTT"), alphabet = 2) ## HTHTHTHT HTHHTHTT ## 0.010366951 0.003102718 likelihood_ratio(c("HTHTHTHT", "HTHHTHTT"), alphabet = 2) ## HTHTHTHT HTHHTHTT ## 0.3767983 1.2589769 prob_random(c("HTHTHTHT", "HTHHTHTT"), alphabet = 2) ## HTHTHTHT HTHHTHTT ## 0.2736772 0.5573217 ## End(Not run) local_complexity(c("01011010111" ,"GHHGGHGHUE"), alphabet = 5, span=5) ## $`01011010111` ## 01011 10110 01101 11010 10101 01011 10111 ## 16.22469 16.24766 16.24766 16.22469 16.24322 16.22469 15.93927 ## ## $GHHGGHGHUE ## GHHGG HHGGH HGGHG GGHGH GHGHU HGHUE ## 16.44639 16.44639 16.24766 16.22469 16.58986 16.86449 local_complexity(c("01011010111" ,"GHHGGHGHUE"), span=7) ## $`01011010111` ## 0101101 1011010 0110101 1101011 1010111 ## 26.52068 26.52068 26.47782 26.62371 26.29186 ## ## $GHHGGHGHUE ## GHHGGHG HHGGHGH HGGHGHU GGHGHUE ## 27.04623 26.86992 27.30871 27.84322
# WARNING: The first call to one of the functions # discussed on this page loads a large data set # and usually takes > 10 seconds. Stay patient. acss(c("HEHHEE", "GHHGGHGHH", "HSHSHHSHSS")) ## K.9 D.9 ## HEHHEE 23.38852 9.106564e-08 ## GHHGGHGHH 33.50168 8.222205e-11 ## HSHSHHSHSS 35.15241 2.618613e-11 acss(c("HEHHEE", "GHHGGHGHH", "HSHSHHSHSS"))[,"K.9"] ## [1] 23.38852 33.50168 35.15241 acss(c("HEHHEE", "GHHGGHGHH", "HSHSHHSHSS"), alphabet = 2) ## K.2 D.2 ## HEHHEE 14.96921 3.117581e-05 ## GHHGGHGHH 25.60208 1.963387e-08 ## HSHSHHSHSS 26.90906 7.935321e-09 acss(c("HEHHEE", "GHHGGHGHUE", "HSHSHHSHSS"), NULL) ## K.2 K.4 K.5 K.6 K.9 ## HEHHEE 14.96921 18.55227 19.70361 20.75762 23.38852 ## GHHGGHGHUE NA 31.75832 33.00795 34.27457 37.78935 ## HSHSHHSHSS 26.90906 29.37852 30.52566 31.76229 35.15241 ## D.2 D.4 D.5 D.6 ## HEHHEE 3.117581e-05 2.601421e-06 1.171176e-06 5.640722e-07 ## GHHGGHGHUE NA 2.752909e-10 1.157755e-10 4.812021e-11 ## HSHSHHSHSS 7.935321e-09 1.432793e-09 6.469341e-10 2.745360e-10 ## D.9 ## HEHHEE 9.106564e-08 ## GHHGGHGHUE 4.209915e-12 ## HSHSHHSHSS 2.618613e-11 ## Not run: likelihood_d(c("HTHTHTHT", "HTHHTHTT"), alphabet = 2) ## HTHTHTHT HTHHTHTT ## 0.010366951 0.003102718 likelihood_ratio(c("HTHTHTHT", "HTHHTHTT"), alphabet = 2) ## HTHTHTHT HTHHTHTT ## 0.3767983 1.2589769 prob_random(c("HTHTHTHT", "HTHHTHTT"), alphabet = 2) ## HTHTHTHT HTHHTHTT ## 0.2736772 0.5573217 ## End(Not run) local_complexity(c("01011010111" ,"GHHGGHGHUE"), alphabet = 5, span=5) ## $`01011010111` ## 01011 10110 01101 11010 10101 01011 10111 ## 16.22469 16.24766 16.24766 16.22469 16.24322 16.22469 15.93927 ## ## $GHHGGHGHUE ## GHHGG HHGGH HGGHG GGHGH GHGHU HGHUE ## 16.44639 16.44639 16.24766 16.22469 16.58986 16.86449 local_complexity(c("01011010111" ,"GHHGGHGHUE"), span=7) ## $`01011010111` ## 0101101 1011010 0110101 1101011 1010111 ## 26.52068 26.52068 26.47782 26.62371 26.29186 ## ## $GHHGGHGHUE ## GHHGGHG HHGGHGH HGGHGHU GGHGHUE ## 27.04623 26.86992 27.30871 27.84322
Functions to compute different measures of complexity for strings: Entropy, Second-Order Entropy, and Change Complexity
entropy(string) entropy2(string) change_complexity(string)
entropy(string) entropy2(string) change_complexity(string)
string |
|
For users who need advanced functions, a comprehensive package computing various versions of entropy estimators is available entropy. For users who just need first and second-order entropy and which to apply them to short string, the acss package provides two functions: entropy
(first-order entropy) and entropy2
second-order entropy.
Change complexity (change_complexity
) assesses cognitive complexity or the subjective perception of complexity of a binary string. It has been comprehensively defined by Aksentijevic and Gibson (2012). Although the algorithm will work with any number of symbols up to 10, the rationale of Change Complexity only applies to binary strings.
numeric
, the complexity of the string. For entropy
and entropy2
of the same length as string
. change_complexity
currently only works with inputs of length 1.
Aksentijevic & Gibson (2012). Complexity equals change. Cognitive Systems Research, 15-17, 1-16.
strings1 <- c("010011010001", "0010203928837", "0000000000") strings2 <- c("001011", "01213", "010101010101") entropy(strings1) entropy("XYXXYYXYXXXY") # "same" string as strings1[1] entropy(c("HUHHEGGTE", "EGGHHU")) entropy2(strings1) entropy2("XYXXYYXYXXXY") entropy2(strings2) change_complexity(strings1) change_complexity("XYXXYYXYXXXY")
strings1 <- c("010011010001", "0010203928837", "0000000000") strings2 <- c("001011", "01213", "010101010101") entropy(strings1) entropy("XYXXYYXYXXXY") # "same" string as strings1[1] entropy(c("HUHHEGGTE", "EGGHHU")) entropy2(strings1) entropy2("XYXXYYXYXXXY") entropy2(strings2) change_complexity(strings1) change_complexity("XYXXYYXYXXXY")
34 participants were asked to produce at their own pace a series of 10 symbols among "A", "B", "C", and "D" that would "look as random as possible, so that if someone else sees the sequence, she will believe it is a truly random one".
exp1
exp1
A data.frame with 34 rows and 2 variables.
Gauvrit, Singmann, Soler-Toscano & Zenil (submitted). Complexity for psychology. A user-friendly implementation of the coding theorem method.
# load data data(exp1) # summary statistics nrow(exp1) summary(exp1$age) mean(exp1$age) sd(exp1$age) ## Not run: # this uses code from likelihood_d() to calculate the mean complexity K # for all strings of length 10 with alphabet = 4: tmp <- acss_data[nchar(rownames(acss_data)) == 10, "K.4", drop = FALSE] tmp <- tmp[!is.na(tmp[,"K.4"]),,drop = FALSE] tmp$count <- count_class(rownames(tmp), alphabet = 4) (mean_K <- with(tmp, sum(K.4*count)/sum(count))) t.test(acss(exp1$string, 4)[,"K.4"], mu = mean_K) ## End(Not run)
# load data data(exp1) # summary statistics nrow(exp1) summary(exp1$age) mean(exp1$age) sd(exp1$age) ## Not run: # this uses code from likelihood_d() to calculate the mean complexity K # for all strings of length 10 with alphabet = 4: tmp <- acss_data[nchar(rownames(acss_data)) == 10, "K.4", drop = FALSE] tmp <- tmp[!is.na(tmp[,"K.4"]),,drop = FALSE] tmp$count <- count_class(rownames(tmp), alphabet = 4) (mean_K <- with(tmp, sum(K.4*count)/sum(count))) t.test(acss(exp1$string, 4)[,"K.4"], mu = mean_K) ## End(Not run)
Responses of one participant (42 years old) to 200 randomly generated strings of length 10 from an alphabet of 6 symbols. For each string, the participant was asked to indicate whther or not the string appears random or not.
exp2
exp2
A data.frame with 200 rows and 2 variables.
Gauvrit, Singmann, Soler-Toscano & Zenil (submitted). Complexity for psychology. A user-friendly implementation of the coding theorem method.
# load data data(exp2) exp2$K <- acss(exp2$string, 6)[,"K.6"] m_log <- glm(response ~ K, exp2, family = binomial) summary(m_log) # odds ratio of K: exp(coef(m_log)[2]) # calculate threshold of 0.5 (threshold <- -coef(m_log)[1]/coef(m_log)[2]) require(effects) require(lattice) plot(Effect("K", m_log), rescale.axis = FALSE, ylim = c(0, 1)) trellis.focus("panel", 1, 1) panel.lines(rep(threshold, 2), c(0, 0.5), col = "black", lwd = 2.5, lty = 3) panel.lines(c(33,threshold), c(0.5, 0.5), col = "black", lwd = 2.5, lty = 3) trellis.unfocus()
# load data data(exp2) exp2$K <- acss(exp2$string, 6)[,"K.6"] m_log <- glm(response ~ K, exp2, family = binomial) summary(m_log) # odds ratio of K: exp(coef(m_log)[2]) # calculate threshold of 0.5 (threshold <- -coef(m_log)[1]/coef(m_log)[2]) require(effects) require(lattice) plot(Effect("K", m_log), rescale.axis = FALSE, ylim = c(0, 1)) trellis.focus("panel", 1, 1) panel.lines(rep(threshold, 2), c(0, 0.5), col = "black", lwd = 2.5, lty = 3) panel.lines(c(33,threshold), c(0.5, 0.5), col = "black", lwd = 2.5, lty = 3) trellis.unfocus()
Mean responses on a 6-point scale ("definitely random" to "definitely not random") of participants to 216 strings of length 21.
matthews2013
matthews2013
A data.frame with 216 rows and 3 variables.
Matthews, W. (2013). Relatively random: Context effects on perceived randomness and predicted outcomes. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(5), 1642-1648.
## Not run: data(matthews2013) spans <- 3:11 # note, the next loop takes more than 5 minutes. for (i in spans) { matthews2013[,paste0("K2_span", i)] <- sapply(local_complexity(matthews2013$string, alphabet=2, span = i), mean) } lm_list <- vector("list", 8) for (i in seq_along(spans)) { lm_list[[i]] <- lm(as.formula(paste0("mean ~ K2_span", spans[i])), matthews2013) } plot(spans, sapply(lm_list, function(x) summary(x)$r.squared), type = "o") # do more predictors increase fit? require(MASS) m_initial <- lm(mean ~ 1, matthews2013) m_step <- stepAIC(m_initial, scope = as.formula(paste("~", paste(paste0("K2_span", spans), collapse = "+")))) summary(m_step) m_initial2 <- lm(as.formula(paste("mean ~", paste(paste0("K2_span", spans), collapse = "+"))), matthews2013) m_step2 <- stepAIC(m_initial2) summary(m_step2) ## End(Not run)
## Not run: data(matthews2013) spans <- 3:11 # note, the next loop takes more than 5 minutes. for (i in spans) { matthews2013[,paste0("K2_span", i)] <- sapply(local_complexity(matthews2013$string, alphabet=2, span = i), mean) } lm_list <- vector("list", 8) for (i in seq_along(spans)) { lm_list[[i]] <- lm(as.formula(paste0("mean ~ K2_span", spans[i])), matthews2013) } plot(spans, sapply(lm_list, function(x) summary(x)$r.squared), type = "o") # do more predictors increase fit? require(MASS) m_initial <- lm(mean ~ 1, matthews2013) m_step <- stepAIC(m_initial, scope = as.formula(paste("~", paste(paste0("K2_span", spans), collapse = "+")))) summary(m_step) m_initial2 <- lm(as.formula(paste("mean ~", paste(paste0("K2_span", spans), collapse = "+"))), matthews2013) m_step2 <- stepAIC(m_initial2) summary(m_step2) ## End(Not run)
normalize_string
takes a character vector and normalizes its input using the symbols 0, 1, 2...9. count_class
takes a character vector and an integer alphabet
(with the restriction that the number of different symbols in the character vector doesn't exceed alphabet
) and returns the total number of strings that are equivalent to the input when normalized and considering alphabet
. alternations
returns the number of alternations of symbols in a string.
normalize_string(string) count_class(string, alphabet) alternations(string, proportion = FALSE)
normalize_string(string) count_class(string, alphabet) alternations(string, proportion = FALSE)
string |
|
alphabet |
|
proportion |
|
nothing yet.
normalize_string
A normalized vector of strings of the same length as string
.
count_class
A vector of the same length as string
with the number of possible equivalent strings when string
is normalized and considering alphabet
.
alternations
A vector with the number (or proprtion) of alternations of the same length as string
#normalize_string: normalize_string(c("HUHHEGGTE", "EGGHHU")) normalize_string("293948837163536") # count_class count_class("010011",2) count_class("332120",4) count_class(c("HUHHEGGTE", "EGGHHU"), 5) count_class(c("HUHHEGGTE", "EGGHHU"), 6) # alternations: alternations("0010233") alternations("0010233", proportion = TRUE) alternations(c("HUHHEGGTE", "EGGHHU")) alternations(c("HUHHEGGTE", "EGGHHU"), proportion = TRUE)
#normalize_string: normalize_string(c("HUHHEGGTE", "EGGHHU")) normalize_string("293948837163536") # count_class count_class("010011",2) count_class("332120",4) count_class(c("HUHHEGGTE", "EGGHHU"), 5) count_class(c("HUHHEGGTE", "EGGHHU"), 6) # alternations: alternations("0010233") alternations("0010233", proportion = TRUE) alternations(c("HUHHEGGTE", "EGGHHU")) alternations(c("HUHHEGGTE", "EGGHHU"), proportion = TRUE)