The Benchmark Data Library Project introduces a new way to approach the design and generation of artificial data. It is not primarily aimed to be a dedicated data generator - it rather provides a framework to enable the researcher to efficiently code easily understandable and reproducible artificial data. The entire concept is based on the notion of merely working with a standardized format for the metadata. All information that is needed for actual data generation is found in a single .R file, the format of which is strictly prescribed.
That depends on whether you have a suitable .R file. You can either obtain one from the metadata repository that serves as a common place to share benchmarking setups (BDLP Repository). If you do not have such a file, you need to write one yourself - the package does not provide an implementation that allows immediate data generation by just calling a function and passing a few arguments. A .R file (which shall be called setup file in the following) is absolutely necessary.
If you have a setup file - such a file containing a VERY simple benchmarking setup that consists of two simple metric 2d datasets is included in the package and can be sourced with
The file only contains one function which has to be of the same name as the file name - always authorYEAR. If several benchmarking setups are available for the same author in one year, additional indication of this fact by authorYEARa/b/c/etc. are permissible. This function can return two things; either a summary about the setup:
## $summary
## n k shape
## 1 50 2 spherical
## 2 40 2 spherical
##
## $reference
## [1] "Dangl R. (2014) A small simulation study. Journal of Simple Datasets 10(2), 1-10"
or a metadata object of one of the two available datasets. This is done by merely providing the the set nr and results in the following object:
## An object of class "metadata.metric"
## Slot "standardization":
## [1] "NONE"
##
## Slot "clusters":
## $c1
## $c1$n
## [1] 25
##
## $c1$mu
## [1] 4 5
##
## $c1$Sigma
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
##
## $c2
## $c2$n
## [1] 25
##
## $c2$mu
## [1] -1 -2
##
## $c2$Sigma
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
##
##
## Slot "genfunc":
## function (n = 1, mu, Sigma, tol = 1e-06, empirical = FALSE, EISPACK = FALSE)
## {
## p <- length(mu)
## if (!all(dim(Sigma) == c(p, p)))
## stop("incompatible arguments")
## if (EISPACK)
## stop("'EISPACK' is no longer supported by R", domain = NA)
## eS <- eigen(Sigma, symmetric = TRUE)
## ev <- eS$values
## if (!all(ev >= -tol * abs(ev[1L])))
## stop("'Sigma' is not positive definite")
## X <- matrix(rnorm(p * n), n)
## if (empirical) {
## X <- scale(X, TRUE, FALSE)
## X <- X %*% svd(X, nu = 0)$v
## X <- scale(X, FALSE, TRUE)
## }
## X <- drop(mu) + eS$vectors %*% diag(sqrt(pmax(ev, 0)), p) %*%
## t(X)
## nm <- names(mu)
## if (is.null(nm) && !is.null(dn <- dimnames(Sigma)))
## nm <- dn[[1L]]
## dimnames(X) <- list(nm, NULL)
## if (n == 1)
## drop(X)
## else t(X)
## }
## <bytecode: 0x560c0a22d390>
## <environment: namespace:MASS>
##
## Slot "seedinfo":
## [[1]]
## [1] 100
##
## [[2]]
## [1] "4.4.2"
##
## [[3]]
## [1] "Mersenne-Twister" "Inversion" "Rejection"
The object consists of several slots. Most importantly, the slot
genfunc
contains the random number generating function (in
this case the default mvrnorm
). The second important slot
is clusters
, which contains the parameters that are needed
by genfunc
to generate data. It is important to notice that
the parameters in clusters
are identical to the arguments
needed by genfunc
. Two other slots are quite crucial for
processing the metadata object: metaseedinfo
and
seedinfo
. The former sets the random number generator
parameters for the purpose of calculating the metadata object. For
example, if the cluster centers were to be chosen randomly and not in a
fixed way like in the case at hand, the metaseedinfo
parameters ensure reproducibility of a particular metadata object. The
latter, seedinfo
, is information that is sent along with
the metadata object to the the function that generates the actual random
numbers of the dataset. It can be practical to have these two sets of
random number generator parameters - if not needed, they default to the
same set of arguments. Therefore, if a metadata object has been created,
one can generate the actual numbers. This can be done for one single
data set from this particular metadata scenario by
## V1 V2
## 1 3.022824 6.146590
## 2 4.127758 6.030671
## 3 2.393572 5.210740
## 4 3.453221 4.794338
## 5 4.225362 6.570782
## 6 4.239723 4.162089
If a different random number seed is needed this can be done by
meta <- dangl2014(1, seedinfo = list(120, "4.0.3", c("Mersenne-Twister", "Inversion")))
data <- generateData(meta)
head(data)
## V1 V2
## 1 3.647198 4.677243
## 2 4.292669 4.307122
## 3 4.419130 4.704274
## 4 4.344384 3.938583
## 5 3.806678 6.125037
## 6 3.869850 4.494429
This data can then be plotted with the commonly used plot functions. Yet still, it is also possible to catch a glimpse of the structure of the data by plotting a metadata object directly. For this purpose, an instance of the dataset is generated automatically, which saves a few steps in-between.
However, generally one may require for benchmarking purposes a large number of datasets drawn from one particular metadata scenario. This works as follows:
Which creates an SQLite database in the current working directory
that contains 50 datasets drawn from matadata object 1 of the setup
name. Of course, metaseedinfo
and seedinfo
is
also available here. The random number seed starts at a certain base
value (default 100) and increments by 1 for each draw. Therefore draw
number 1 uses seed 101, etc. If a different increment step is desired,
one can set the argument increment
to this value.
In order to explore the concept in more depth we look at the function body that is in the included setup file:
dangl2014 <- function(setnr = NULL,
seedinfo = list(100,
paste(R.version$major, R.version$minor, sep = "."),
RNGkind()),
info = FALSE,
metaseedinfo = list(100,
paste(R.version$major, R.version$minor, sep = "."),
RNGkind())){
inf <- data.frame(n = c(50, 40), k = c(2,2), shape = c("spherical", "spherical"))
ref <- "Dangl R. (2014) A small simulation study. Journal of Simple Datasets 10(2), 1-10"
if(info == T) return(list(summary = inf, reference = ref))
if(is.null(metaseedinfo)) metaseedinfo <- seedinfo
set.seed(metaseedinfo[[1]])
RNGversion(metaseedinfo[[2]])
RNGkind(metaseedinfo[[3]][1], metaseedinfo[[3]][2])
if(setnr == 1) {
return(new("metadata.metric",
clusters = list(c1 = list(n = 25, mu = c(4,5), Sigma=diag(1,2)),
c2 = list(n = 25, mu = c(-1,-2), Sigma=diag(1,2))),
genfunc = MASS::mvrnorm, seedinfo = seedinfo))
}
if(setnr == 2){
return(new("metadata.metric",
clusters = list(c1 = list(n = 20, mu = c(0,2), Sigma=diag(1,2)),
c2 = list(n = 20, mu = c(-1,-2), Sigma=diag(1,2))),
genfunc = MASS::mvrnorm, seedinfo = seedinfo))
}
}
Again, we can see that the first part of the function provides the information output that has already been described above. The second half of the function specifies the metadata objects. The function arguments are fixed and must not be changed. In the middle, the metaseedinfo parameters are applied in case random effects are used in metadata generation, and the seedinfo parameters are passed on to the metadata object output. This is certainly a simple example; much more complex scenarios can be realised. The package functions merely assemble the dataset based on the parameters supplied in the setup file cluster by cluster. Therefore, any random number generating function of any R package can be used. Still, also custom functions written from scratch can be used and included in the setup file. The only strict limitation that is imposed on the setup file is that the main function that is then used for data generation must produce the metadata files according to the structure defined in the package.
Essentially, by writing a new setup file that complies with a few important rules:
info = T
, or a metadata object (for a specific
setnr
and metaseedinfo
)authorYear_myfunc
in order to avoid conflicts with other
setup files which could coincidentally contain identically named
functionsAt the moment, 5 types of metadata objects are supported: metric, ordinal, binary, functional and random string data. It is of course possible to have several types of metadata in a benchmarking setup. At the moment it is not yet possible to have a metadata object that mixes several types of data in one dataset (e.g. some metric, ordinal and binary variables), but support for this is planned for the next release.
The setup file does not need to be created completely from scratch;
two function can do that to different degrees. Function
createFileskeleton
creates a blank template that fills in
the most important information (function names, arguments, return values
etc.). A much more convenient solution is saveSetup - if several
metadata objects have been created (this in turn can be done with the
help of initializeObject
), a setup file can be written to
the disk without any further action necessary. It can be used
immediately for data generation. A simple example looks as follows:
require(MASS)
m1 <- initializeObject(type = "metric", genfunc = mvrnorm, k = 2)
m1@clusters$cl1 <- list(n = 25, mu = c(4,5), Sigma = diag(1,2))
m1@clusters$cl2 <- list(n = 25, mu = c(-1,-2), Sigma = diag(1,2))
m2 <- initializeObject(type = "metric", genfunc = mvrnorm, k = 2)
m2@clusters$cl1 <- list(n = 44, mu = c(1,2), Sigma = diag(1,2))
m2@clusters$cl2 <- list(n = 66, mu = c(-5,-6), Sigma = diag(1,2))
saveSetup(name="miller2012.R", author="John Miller", mail="[email protected]",
inst="Example University", cit="Simple Data, pp. 23-24", objects=list(m1, m2),
table=data.frame(n = c(50, 110), k = c(2,2), shape = c("spherical", "spherical")))
generateDatabase(name = "miller2012.R", setnr = 1, draws = 20)
The procedure is basically identical for ordinal, binary and random string data. Functional data requires a structurally quite different metadata object. It is advised to have a look at the class description for functional metadata in the package manual. An example for functional data works as follows:
Fun1 <- function(x){x^2}
Fun2 <- function(x){sqrt(x)}
Fun3 <- function(x){sin(2*pi*x)}
functions <- list(Fun1 = Fun1, Fun2 = Fun2, Fun3 = Fun3)
interval <- c(0,1)
gridPoints <- 30
sd <- 0.2
n <- 100
minTimePoints <- 5
maxTimePoints <- 10
regular <- FALSE
grid <- sampleGrid(n, minTimePoints, maxTimePoints, gridPoints, regular)
meta <- new("metadata.functional", functions = functions,
gridMatrix = grid,
sd=sd,
sd_distribution="rnorm",
interval = interval,
resolution=gridPoints,
total_n = n,
minTimePoints = minTimePoints,
maxTimePoints = maxTimePoints,
regular=F)
data <- generateData(meta)
head(data)
## curves xvalvector yvalvector
## 1 1 0.0000000 0.22931806
## 2 1 0.1379310 0.22515913
## 3 1 0.2413793 0.10041188
## 4 1 0.2758621 0.03496742
## 5 1 0.3448276 0.43306242
## 6 1 0.4827586 0.06547377
New setup files that are used in actual benchmarking studies can be added to the repository of the Benchmark Data Library Project (link above), which is highly appreciated! This might greatly help other researchers who like to use artificial data in their studies and who do not want to reinvent the wheel. Furthermore, benchmarking studies are much more meaningful if methods can be compared on the exact same data from another study. This is actually one of the major effects that are intended to come along with using this package.