Title: | Manage and manipulate large collections of R objects stored as tape-like files |
---|---|
Description: | Storing huge data in RData format causes problems because of the necessity to load the whole file to the memory in order to access and manipulate objects inside such file; rtape is a simple solution to this problem. The package contains several wrappers of R built-in serialize/unserialize mechanism allowing user to quickly append objects to a tape-like file and later iterate over them requiring only one copy of each stored object to reside in memory a time. |
Authors: | Miron B. Kursa |
Maintainer: | Miron B. Kursa <[email protected]> |
License: | GPL-2 |
Version: | 2.2 |
Built: | 2024-12-03 06:58:20 UTC |
Source: | CRAN |
Automatically pick proper tape file format.
guessFileFormat(fName)
guessFileFormat(fName)
fName |
Name of the the file to guess format of; if the file is not-existing, the function returns default file format. |
This function guesses the tape file format from the file header or assumes default (gzip) if given non-existing file name. Main package functions use this routine to automatically setup file format; if you really need to control it, see makeFileFormat
.
The function to be passed to the fileFormat*
arguments of other rtape
's functions.
Miron B. Kursa [email protected]
Setting tape file format/compression.
makeFileFormat(compression="gz", compressionLevel=ifelse(compression == "bz", 9, 6))
makeFileFormat(compression="gz", compressionLevel=ifelse(compression == "bz", 9, 6))
compression |
Name of the compression algorithm; should be one of the |
compressionLevel |
rtape
uses R connections to store data; this function creates a function that is used to create a connection by the other rtape
's functions. Changing its parameters allows advanced user to change compression format/level and thus control the speed/file size trade-off. The default values (gzip, 6th level of compression) should give performance similar to this of save
.
The function to be passed to the fileFormat*
arguments of other rtape
's functions.
Effectively, this function is needed only to set up the format of the new, blank tape (i.e. in the first call to rtapeAdd
or for altering compression along with tape reconstruction operations performed by rtapeRerecord
or rtapeFilter
); when dealing with already existing tapes, the guessFileFormat
will recognise the right format from the file header.
Miron B. Kursa [email protected]
Iterate over tape, discarding results.
rtape_apply(fNames, FUN, ...)
rtape_apply(fNames, FUN, ...)
fNames |
Name of the tape file to read; if this argument is a vector of several names, function behaves as reading a single tape made of all those tapes joined in a given order. |
FUN |
Callback function. |
... |
Additional parameters to |
This function read the tape from the oldest to the newest writes and executes the callback function on each read object. Logically, it is an equivalent to ignore<-lapply(rtapeAsList(fName),FUN,...)
, but it is optimized to store only the currently processed object in the memory and to discard FUN
results as soon as they appear.
Miron B. Kursa [email protected]
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',runif(3)) rtapeAdd('tmp.tape',rnorm(3)) #Print tape contents rtape_apply('tmp.tape',print) unlink('tmp.tape')
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',runif(3)) rtapeAdd('tmp.tape',rnorm(3)) #Print tape contents rtape_apply('tmp.tape',print) unlink('tmp.tape')
Add object to the tape.
rtapeAdd(fName, what, skipNULLs=FALSE, fileFormat=guessFileFormat(fName), safe=FALSE, retryTime=0.1)
rtapeAdd(fName, what, skipNULLs=FALSE, fileFormat=guessFileFormat(fName), safe=FALSE, retryTime=0.1)
fName |
Name of the tape file. |
what |
Object to be stored. |
skipNULLs |
If true and |
fileFormat |
File format; should be left default. See |
safe |
If |
retryTime |
If |
This function serializes and appends a given object on the end of the tape file. If the tape file does not exists, it is created.
Remember to use the same fileFormat
value to all writes to a certain tape (or use default format guesser to guarantee this); if not, the tape will become unreadable. For the same reason, don't try to put custom headers/footers or append other data inside tape stream.
This function is thread/process safe only if you use safe
mode. However, in this case it may jam on a broken dirlock (for instance when the locking R process crashed during write); you may fix this problem manually by removing the locking dir. Its name is always .rtape_<tape file name>_lock
. Waiting in retry mode is performed via Sys.sleep
, so it is not a busy wait.
Miron B. Kursa [email protected]
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',iris) rtapeAdd('tmp.tape',sin(1:10)) #Read whole tape to the list, so we could examine it rtapeAsList('tmp.tape')->stored print(stored) unlink('tmp.tape')
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',iris) rtapeAdd('tmp.tape',sin(1:10)) #Read whole tape to the list, so we could examine it rtapeAsList('tmp.tape')->stored print(stored) unlink('tmp.tape')
Load the whole tape as a list.
rtapeAsList(fNames)
rtapeAsList(fNames)
fNames |
Name of the tape file to read; if this argument is a vector of several names, function behaves as reading a single tape made of all those tapes joined in a given order. |
This function reads are the objects from the tape, in the order they were written on it, and returns them as a list.
A list containing all the objects stored on the tape.
Miron B. Kursa [email protected]
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',iris) rtapeAdd('tmp.tape',sin(1:10)) #Read whole tape to the list, so we could examine it rtapeAsList('tmp.tape')->stored print(stored) unlink('tmp.tape')
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',iris) rtapeAdd('tmp.tape',sin(1:10)) #Read whole tape to the list, so we could examine it rtapeAsList('tmp.tape')->stored print(stored) unlink('tmp.tape')
Rerecord the tape dropping certain objects.
rtapeFilter(FUN, fNamesIn, fNameOut=fNamesIn, moreArgs, fileFormatOut=guessFileFormat(fNameOut))
rtapeFilter(FUN, fNamesIn, fNameOut=fNamesIn, moreArgs, fileFormatOut=guessFileFormat(fNameOut))
FUN |
Callback function which gets the current object and returns a boolean value that directs |
fNamesIn |
Name of the tape file to read; if this argument is a vector of several names, function behaves as reading a single tape made of all those tapes joined in a given order. |
fNameOut |
Name of the tape to which store the output of filtering; if this file is one of the input files, this file is overwritten with the output; otherwise the output is appended to this tape. This must be a one-element vector. |
moreArgs |
Additional arguments to |
fileFormatOut |
File format; should be left default. See |
This function reads the objects from one tape, executes a callback function on them and leaves/appends to the other tape only those for which callback returned TRUE
.
Overwriting is NOT realised in place, rather by a creation of a temporary file and then using it to overwrite the filtered tape.
Miron B. Kursa [email protected]
unlink(c('tmp.tape')) #Record something for(i in 1:10) rtapeAdd('tmp.tape',i) #Discard even numbers rtapeFilter(function(x) (x%%2)==1,'tmp.tape') #Check it out unlist(rtapeAsList('tmp.tape'))->A print(A); stopifnot(all(A==c(1,3,5,7,9))) #Time to clean up unlink(c('tmp.tape'))
unlink(c('tmp.tape')) #Record something for(i in 1:10) rtapeAdd('tmp.tape',i) #Discard even numbers rtapeFilter(function(x) (x%%2)==1,'tmp.tape') #Check it out unlist(rtapeAsList('tmp.tape'))->A print(A); stopifnot(all(A==c(1,3,5,7,9))) #Time to clean up unlink(c('tmp.tape'))
Iterate over tape, gathering results.
rtapeLapply(fNames, FUN, ...)
rtapeLapply(fNames, FUN, ...)
fNames |
Name of the tape file to read; if this argument is a vector of several names, function behaves as reading a single tape made of all those tapes joined in a given order. |
FUN |
Callback function. |
... |
Additional parameters to |
This function read the tape from the oldest to the newest writes and executes the callback function on each read object. Logically, it is an equivalent to lapply(rtapeAsList(fName),FUN,...)
, but it is optimized to store only the currently processed object in the memory.
A list containing results of FUN
calls.
Miron B. Kursa [email protected]
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',runif(3)) rtapeAdd('tmp.tape',rnorm(3)) #Print tape contents rtape_apply('tmp.tape',print) unlink('tmp.tape')
unlink('tmp.tape') #Record something on the tape rtapeAdd('tmp.tape',runif(3)) rtapeAdd('tmp.tape',rnorm(3)) #Print tape contents rtape_apply('tmp.tape',print) unlink('tmp.tape')
Rerecord the tape.
rtapeRerecord(FUN, fNamesIn, fNameOut=fNamesIn, moreArgs, skipNULLs=FALSE, fileFormatOut=guessFileFormat(fNameOut))
rtapeRerecord(FUN, fNamesIn, fNameOut=fNamesIn, moreArgs, skipNULLs=FALSE, fileFormatOut=guessFileFormat(fNameOut))
FUN |
Callback function which transforms the objects. |
fNamesIn |
Name of the tape file to read; if this argument is a vector of several names, function behaves as reading a single tape made of all those tapes joined in a given order. |
fNameOut |
Name of the tape to which store the output of filtering; if this file is one of the input files, this file is overwritten with the output; otherwise the output is appended to this tape. This must be a one-element vector. |
moreArgs |
Additional arguments to |
skipNULLs |
If true, all the |
fileFormatOut |
File format; should be left default. See |
This function reads the objects from one tape, executes a callback function on them and updates them with/appends to the other tape the objects that the callback has returned.
Overwriting is NOT realised in place, rather by a creation of a temporary file and then using it to overwrite the filtered tape.
Miron B. Kursa [email protected]
unlink(c('tmp.tape','tmp2.tape')) #Record something for(i in 1:10) rtapeAdd('tmp.tape',i) #Multiply each object by two rtapeRerecord('*','tmp.tape','tmp2.tape',moreArgs=list(2)) #Check it out unlist(rtapeAsList('tmp.tape'))->A B<-unlist(rtapeAsList('tmp2.tape')) print(A);print(B) stopifnot(all(A==B/2)) #Now do the same in-place: rtapeRerecord('*','tmp.tape',moreArgs=list(2)) unlist(rtapeAsList('tmp.tape'))->B2 stopifnot(all(A==B2/2)) #Time to clean up unlink(c('tmp.tape','tmp2.tape'))
unlink(c('tmp.tape','tmp2.tape')) #Record something for(i in 1:10) rtapeAdd('tmp.tape',i) #Multiply each object by two rtapeRerecord('*','tmp.tape','tmp2.tape',moreArgs=list(2)) #Check it out unlist(rtapeAsList('tmp.tape'))->A B<-unlist(rtapeAsList('tmp2.tape')) print(A);print(B) stopifnot(all(A==B/2)) #Now do the same in-place: rtapeRerecord('*','tmp.tape',moreArgs=list(2)) unlist(rtapeAsList('tmp.tape'))->B2 stopifnot(all(A==B2/2)) #Time to clean up unlink(c('tmp.tape','tmp2.tape'))