Title: | Reading Machine Learning Benchmark Data Sets in Different Formats |
---|---|
Description: | Functions for reading data sets in different formats for testing machine learning tools are provided. This allows to run a loop over several data sets in their original form, for example if they are downloaded from UCI Machine Learning Repository. The data are not part of the package and have to be downloaded separately. |
Authors: | Petr Savicky |
Maintainer: | Petr Savicky <[email protected]> |
License: | GPL-3 |
Version: | 0.9-7 |
Built: | 2024-12-10 06:54:39 UTC |
Source: | CRAN |
The package contains functions, which allow to maintain and use
a structure describing a collection of machine learning datasets
and read them into R environment using a unified interface, see
function prepareDSList()
and dsRead()
.
The data are not part of the package. The package requires to
receive a path to a local copy of the data and their description.
The description of the data sets consists of a directory, which
contains an XML file contents.xml
and subdirectory "scripts",
which contains an R script for each data set, which reads the
data set into R. File contents.xml
contains information
on all the data sets. In particular it contains their names for
local identification, their public names, and the names of files
representing the data set. The name of the script for reading
a data set is derived from its identification name. The complete
list of the fields in contents.xml
may be obtained using
getFields()
.
For the simplest use of the package for reading the data sets, the
functions prepareDSList()
and dsRead()
are sufficient.
The remaining functions are useful for including further data sets to
the description. Use help(package=readMLData)
or
library(help=readMLData)
to see the list of functions.
The list of fields, which should be included in "contents.xml"
,
consists of the fields with either usage=="obligatory"
or
usage=="optional"
in the table produced by getFields()
.
Fields with usage=="additional"
and usage=="computed"
are included automatically by the function prepareDSList()
.
An example of the description directory describing three UCI data sets
is in exampleDescription
subdirectory of the installed package.
The data themselves are in exampleData
subdirectory. See
http://www.cs.cas.cz/~savicky/readMLData/ for description
files of further data sets from UCI Machine Learning Repository.
Petr Savicky
UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/.
Additional resources for the CRAN package readMLData, http://www.cs.cas.cz/~savicky/readMLData/.
For each column, its class and the number of different values is determined. For numeric columns, also the minimum and maximum is computed.
analyzeData(dat)
analyzeData(dat)
dat |
A data frame. |
A data frame with columns "class", "num.unique", "min", "max"
, which
correspond to properties of columns of dat
. The rows in the output
data frame correspond to the columns of dat
.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "glass") analyzeData(dat)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "glass") analyzeData(dat)
dsList
.Checks consistency of the parameters specified for each dataset in the dsList
data frame created by prepareDSList()
.
checkConsistency(dsList, outputInd=FALSE)
checkConsistency(dsList, outputInd=FALSE)
dsList |
Data frame as created by |
outputInd |
Logical. Determines, whether the output should be a vector of indices of the data sets with conflicts. |
Depending on outputInd
, either a vector of indices of data sets with
a conflict between the specified parameters or NULL invisibly.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) checkConsistency(dsList)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) checkConsistency(dsList)
dsList
and in a data set
itself.Compares types.
checkType(dsList, id, dat=NULL)
checkType(dsList, id, dat=NULL)
dsList |
Data frame describing the data sets as produced by |
id |
Numeric or character of length one. Index or the identification of a data set. |
dat |
An optional data frame as read by |
The name of the tested data set and the result of the test is printed.
If errors are found, a more detailed message is printed. The output value
is TRUE
or FALSE
invisibly according, whether the types are
correct or not.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) checkType(dsList, 1)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) checkType(dsList, 1)
The function allows to run an external download tool with arguments read from a file in a data folder.
dsDownload(dsList, id, command, fileName)
dsDownload(dsList, id, command, fileName)
dsList |
Data frame as created by |
id |
Name of the data set in |
command |
Character. A command line web downloding tool, for example
|
fileName |
Character. A name of the file in the data directory, which contains the URL of the data on the web. |
If no data set or more than one data set corresponding to id
is found,
a corresponding error message is printed.
Function has no value. The protocol generated by the specified tool is printed.
Petr Savicky
## Not run: pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsDownload(dsList, "glass", "wget", "links.txt") ## End(Not run)
## Not run: pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsDownload(dsList, "glass", "wget", "links.txt") ## End(Not run)
The function allows to read data sets included in the description
in the data frame dsList
into R environment using a unified interface.
dsRead(dsList, id, responseName = NULL, originalNames=TRUE, deleteUnused=TRUE, keepContents=FALSE)
dsRead(dsList, id, responseName = NULL, originalNames=TRUE, deleteUnused=TRUE, keepContents=FALSE)
dsList |
Data frame as created by |
id |
Name of the data set in |
responseName |
Character. The required name of the response column in the output data frame created from the data set. |
originalNames |
If TRUE, the original names of columns are used, if they are present in the description XML file. |
deleteUnused |
Logical. Controls, whether the columns containing case labels or other columns not suitable as attributes, are removed from the data. |
keepContents |
Logical. If |
The function uses dsList$avaiable
to determine, whether the files for
the required data set is present in the local directory dsList$pathData
.
If not, a corresponding error message is printed. See prepareDSList()
and getAvailable()
.
A data frame containing the required data set, possibly transformed according
to the setting of the parameters responseName, originalNames, deleteUnused
.
If an error occurred, the function outputs NULL
.
Petr Savicky
readMLData
, prepareDSList
, getAvailable
.
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "glass") dim(dat)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "glass") dim(dat)
dsList
.The function allows string matching against some of the fields
"identification", "fullName", "dirName", "files"
of the
structure describing the data sets.
dsSearch(dsList, id, searchField=c("identification", "fullName", "dirName", "files"), searchType=c("exact", "prefix", "suffix", "anywhere"), caseSensitive=FALSE)
dsSearch(dsList, id, searchField=c("identification", "fullName", "dirName", "files"), searchType=c("exact", "prefix", "suffix", "anywhere"), caseSensitive=FALSE)
dsList |
Data frame as created by |
id |
Character of length one or numeric of length at most |
searchField |
Character. Name of a column in |
searchType |
Character. Type of search. |
caseSensitive |
Logical. Whether the search should be case sensitive. |
The parameter searchField
determines, which column of dsList
is searched, parameters searchType
and caseSensitive
influence the type of search. These three parameters are ignored, if
id
is numeric.
Regular expressions are not used. Matching with searchType="exact"
is done with ==
, searchType="prefix"
and searchType="suffix"
are implemented using substr()
, searchType="anywhere"
is
implemented using grep(, fixed=TRUE)
.
Data frame containing the indices and identification of the matching data sets and the value of the search field, if applicable.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dsSearch(dsList, "ident", searchField="fullName", searchType="anywhere")
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dsSearch(dsList, "ident", searchField="fullName", searchType="anywhere")
Sort the rows of a data frame lexicographically. This allows to compare two data sets as sets of cases disregarding their order.
dsSort(dat)
dsSort(dat)
dat |
a dataframe. |
The function calls order()
with the columns of dat
as the
sorting criteria.
Data frame, whose rows are reordered by the sorting.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "glass") sorted <- dsSort(dat)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "glass") sorted <- dsSort(dat)
dsList
.Checks whether all the files of a specified data set are accesible in a local directory.
getAvailable(dsList, id=NULL, asLogical=FALSE)
getAvailable(dsList, id=NULL, asLogical=FALSE)
dsList |
Data frame as created by |
id |
Character or numeric vector. A character vector should contain
names matching the names |
asLogical |
Logical, whether the output should be a logical
vector of the same length as |
The test is not completely reliable, since it only verifies that
the files with the required file name are accessible. If the
files require some transformations after download and these
are not performed, the data set is still reported as available.
The test uses file names specified in contents.xml
file.
If these names are by mistake different from the files actually
read in the reading scripts, then the test may also yield an
incorrect result.
Logical vector of the length length(id)
specifying for
each component of id
the result of the check or a character
vector containing the identification of the available data sets.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) getAvailable(dsList)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) getAvailable(dsList)
dsList
describing the data sets.The data frame dsList
contains names of the data sets, the names
of the directories, the files, which belong to each of the data sets,
and some other information. The function returns a table describing the
fields and their usage.
getFields()
getFields()
Table containing the names, types and usage of the fields expected
in dsList
.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) getFields()
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) getFields()
Appends the path to the directory of an installed package and a name of its subdirectory.
getPath(dirName)
getPath(dirName)
dirName |
Character. Name of the example subdirectory of
an installed package. This is currently |
Character string, which is a full path to the required example directory in an installed package.
Petr Savicky
The type information is derived from the contents of individual columns of an input data frame.
getType(dat)
getType(dat)
dat |
A data frame. |
A character vector of length ncol(dat)
containing "n" for numerical
columns, the number of different values for character or factor columns,
and "o" otherwise.
Petr Savicky
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "annealing") getType(dat)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription) dat <- dsRead(dsList, "annealing") getType(dat)
dsList
, which describes the data contained
in a local data description directory. The data frame dsList
is needed to read the data contained in
a directory tree below dsList$pathData
using dsRead()
.
The directory pathDescription
is expected to contain the
file contents.xml
and subdirectory scripts
with R scripts
for reading the data sets.
prepareDSList(pathData, pathDescription)
prepareDSList(pathData, pathDescription)
pathData |
Character. A path to the required data directory. |
pathDescription |
Character. A path to a directory containing
description of the required data, in particular the file |
The character "~" expands to your home directory.
The directory pathData
need not contain all the data sets
included in pathDescription/contents.xml
. The function
getAvailable()
is called and its output is stored
in column availability
of the output data frame, which is
logical
and specifies for each data set, whether it is or
is not present.
See http://www.cs.cas.cz/~savicky/readMLData/ for description
files of some of the data sets from UCI Machine Learning Repository.
See the help page readMLData
for more information
on the structure of the description files.
Data frame with columns pathData
, pathDescription
,
and other as listed by getFields()
. The output data frame
can be used as dsList
parametr of functions dsSearch()
,
dsRead()
, checkConsistency()
, checkType()
.
Petr Savicky
readMLData
, getAvailable
, checkConsistency
.
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription)
pathData <- getPath("exampleData") pathDescription <- getPath("exampleDescription") dsList <- prepareDSList(pathData, pathDescription)
Input and output of a data set description from and to a XML file. These functions
are not inteded for direct use by the user for reading the data sets. The
function readDSListFromXML()
is called from prepareDataDir()
.
The function saveDSListAsXML
is used for preparing the file
contents.xml
in the data set description directory.
readDSListFromXML(filename) saveDSListAsXML(dsList, filename)
readDSListFromXML(filename) saveDSListAsXML(dsList, filename)
dsList |
A data frame created by |
filename |
The name of an XML file to be used. |
saveDSListAsXML()
returns the filename of the created file.
readDSListFromXML()
returns a data frame with the description of the data sets.
Petr Savicky