Title: | Benford's Analysis on Large Data Sets |
---|---|
Description: | Perform the Benford's Analysis to a data set in order to evaluate if it contains human fabricated data. For more details on the method see Moreau, 2021, Model Assist. Statist. Appl., 16 (2021) 73–79. <doi:10.3233/MAS-210517>. |
Authors: | Vitor Hugo Moreau |
Maintainer: | Vitor Hugo Moreau <[email protected]> |
License: | Creative Commons Attribution 4.0 International License |
Version: | 1.0.1 |
Built: | 2024-12-01 08:45:48 UTC |
Source: | CRAN |
Benford's analysis makes use of a statistic property of natural data sets called Benford's Law. Benford’s Law (also called “first digit phenomenon”) is a statistical phenomenon that describes the frequency of a given integer, from 1 to 9, to be in the first significant digit in the numbers of a large data set. The Benford’s law has been most practically used to detect fraud or rounding errors in real world numbers. This is possible by examining departures in the frequencies of individual digits from those predicted by Benford. This only makes sense once it is established (often empirically) that the data follow the law under normal circumstances (Sambridge et al., 2011). This is true because human pseudo-random productions are in many ways different from true randomness (Nickerson, 2002). As a consequence, fabricated data might fit to the Benford’s Law to a lesser extent than genuine data (Banks and Hill, 1974; Gauvrit et al., 2017). Benford package is able to analyze the frequence of the first, second, first-two and first-three digits in large data sets.
benford(x, plot = FALSE, mode = 1)
benford(x, plot = FALSE, mode = 1)
x |
A numeric vector with the data set numbers to be analyzed |
plot |
A logic that control whether the resulting first digit distribution and the Benford's distribution would be ploted |
mode |
A numeric value (1, 2, 12 or 123) to select, respectively, first digit, second digit, first-two digits or first-three digits analysis |
LIST countaining: 1. Named vector with three elements: the Chi Square test p value (p), the root mean square deviation (RMSD) from the Benford's distribution, and the log of the likelihhod of the first digit distribution in relation to the Benford's distribution; 2. Matrix, with three columns, countaining the first digits ([,1]), the frequency counts of the first digit in the data set ([,2]) and the frequency count of the first digit in a classic Benford's distribution ([,3])
RMSD and likelihood are not formal statistic tests, so it may be evaluated only in a comparative way. To perform analysis in order to get to absolute conclusion on the veracity of the data set, Chi square p value is more trustable. For first-two and first-three digits analysis, the number of observation in the data set must be large enough to permit good Chi-square calculation. Otherwise, benford will return a warning message.
Vitor Hugo Moreau, Ph.D. Department of Biotechnology Federal University of Bahia, Brazil
Banks WP, Hill DK. 1974. The apparent magnitude of number scaled by random production. J. Exp. Psychol. 102:353–376. <http://content.apa.org/journals/xge/102/2/353>. Benford F. 1938. The Law of Anomalous Numbers. Proc. Am. Philos. Soc. 78:551–572. <http://www.jstor.org/stable/984802>. Gauvrit N, Houillon J-C, Delahaye J-P. 2017. Generalized Benford’s Law as a Lie Detector. Adv. Cogn. Psychol. 13:121–127. <http://ac-psych.org/en/download-pdf/id/214>. Moreau, V. H. 2021. Inconsistencies in countries COVID-19 data revealed by Benford’s law. Model Assisted Statistics and Applications 16 (2021) 73–79. <http://dx.doi.org/10.3233/MAS-210517> Nickerson RS. 2002. The production and perception of randomness. Psychol. Rev. 109:330–357. <http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-295X.109.2.330>. Sambridge M, Tkalcic H, Arroucau P. 2011. Benford’s Law of First Digits: From Mathematical Curiosity to Change Detector. Asia Pacific Math. Newsl. 1:1–5.
#Computer generated random data do not conform to the benford law result <- benford(seq(1,10000)+rnorm(10000,0,100), TRUE) #Natural data set, countaining the number of daily new cases of COVID-19 in Switzerland ##conform to the Benford' Law result <- benford(switz.data, TRUE) ##conform to second digit analysis of the Benford' Law result <- benford(switz.data, TRUE, 2)
#Computer generated random data do not conform to the benford law result <- benford(seq(1,10000)+rnorm(10000,0,100), TRUE) #Natural data set, countaining the number of daily new cases of COVID-19 in Switzerland ##conform to the Benford' Law result <- benford(switz.data, TRUE) ##conform to second digit analysis of the Benford' Law result <- benford(switz.data, TRUE, 2)
A data set with the number of new cases of COVID-19 in Switzerland. A natural data set that conform very well to the Benford's law
data("switz.data")
data("switz.data")
A numeric vector with 383 observations.
x
a numeric vector
Our World in Data COVID-19 project: https://ourworldindata.org/coronavirus-data
Roser M, Ritchie H, Ortiz-Ospina E, Hasel J. 2020. Coronavirus Pandemic (COVID-19). https://ourworldindata.org/coronavirus.
data(switz.data) ## maybe str(switz.data) ; plot(switz.data) ...
data(switz.data) ## maybe str(switz.data) ; plot(switz.data) ...