In GRP.default()
, the "group.starts"
attribute is always returned, even if there is only one group or every observation is its own group. Thanks @JamesThompsonC (#631).
Fixed a bug in pivot()
if na.rm = TRUE
and how = "wider"|"recast"
and there are multiple value
columns with different missingness patterns. In this case na_omit(values)
was applied with default settings to the original (long) value columns, implying potential loss of information. The fix applies na_omit(values, prop = 1)
, i.e., only removes completely missing rows.
qDF()/qDT()/qTBL()
now allow a length-2 vector of names to row.names.col
if X
is a named atomic vector, e.g., qDF(fmean(mtcars), c("cars", "mean"))
gives the same as pivot(fmean(mtcars, drop = FALSE), names = list("car", "mean"))
.
Added a subsection on using internal (ad-hoc) grouping to the collapse for tidyverse users vignette.
qsu()
now adds a WeightSum
column giving the sum of (non-zero or missing) weights if the w
argument is used. Thanks @mayer79 for suggesting (#650). For panel data (pid
) the 'Between' sum of weights is also simply the number of groups, and the 'Within' sum of weights is the 'Overall' sum of weights divided by the number of groups.
Fixed an inaccuracy in fquantile()/fnth()
with weights: As per documentation the target sum is sumwp = (sum(w) - min(w)) * p
, however, in practice, the weight of the minimum element of x
was used instead of the minimum weight. Since the smallest element in the sample usually has a small weight this was unnoticed for a long while, but thanks to @Jahnic-kb now reported and fixed (#659).
Fixed a bug in recode_char()
when regex = TRUE
and the default
argument was used. Thanks @alinacherkas for both reporing and fixing (#654).
Fixes an installation bug on some Linux systems (conflicting types) (#613).
collapse now enforces string encoding in fmatch()
/ join()
, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is internally coerced to UTF8 strings before the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, collapse assumes that character vectors are uniform in terms of string encoding. Heterogeneous strings should be coerced using tools like stringi::stri_trans_general(x, "latin-ascii")
.
Fixes a bug using qualified names for fast statistical functions inside across()
(#621, thanks @alinacherkas).
collapse now depends on R >= 3.4.0 due to the enforcement of STRICT_R_HEADERS = 1
from R v4.5.0. In particular R API functions were renamed Calloc -> R_Calloc
and Free -> R_Free
.
Some changes on the C-side to move the package closer to C API compliance (demanded by R-Core). One notable change is that gsplit()
no longer supports S4 objects (because SET_S4_OBJECT
is not part of the API and asS4()
is too expensive for tight loops). I cannot think of a single example where it would be necessary to split an S4 object, but if you do have applications please file an issue.
pivot()
has new arguments FUN = "last"
and FUN.args = NULL
, allowing wide and recast pivots with aggregation (default last value as before). FUN
currently supports a single function returning a scalar value. Fast Statistical Functions receive vectorized execution. FUN.args
can be used to supply a list of function arguments, including data-length arguments such as weights. There are also a couple of internal functions callable using function strings: "first"
, "last"
, "count"
, "sum"
, "mean"
, "min"
, or "max"
. These are built into the reshaping C-code and thus extremely fast. Thanks @AdrianAntico for the request (#582).
join()
now provides enhanced verbosity, indicating the average order of the join between the two tables, e.g.
join(data.frame(id = c(1, 2, 2, 4)), data.frame(id = c(rep(1,4), 2:3)))
#> left join: x[id] 3/4 (75%) <1.5:1st> y[id] 2/6 (33.3%)
#> id
#> 1 1
#> 2 2
#> 3 2
#> 4 4
join(data.frame(id = c(1, 2, 2, 4)), data.frame(id = c(rep(1,4), 2:3)), multiple = TRUE)
#> left join: x[id] 3/4 (75%) <1.5:2.5> y[id] 5/6 (83.3%)
#> id
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 4
In collap()
, with multiple functions passed to FUN
or catFUN
and return = "long"
, the "Function"
column is now generated as a factor variable instead of character (which is more efficient).
Updated 'collapse and sf' vignette to reflect the recent support for units objects, and added a few more examples.
Fixed a bug in join()
where a full join silently became a left join if there are no matches between the tables (#574). Thanks @D3SL for reporting.
Added function group_by_vars()
: A standard evaluation version of fgroup_by()
that is slimmer and safer for programming, e.g. data |> group_by_vars(ind1) |> collapg(custom = list(fmean = ind2, fsum = ind3))
. Or, using magrittr:
library(magrittr)
set_collapse(mask = "manip") # for fgroup_vars -> group_vars
data %>%
group_by_vars(ind1) %>% {
add_vars(
group_vars(., "unique"),
get_vars(., ind2) %>% fmean(keep.g = FALSE) %>% add_stub("mean_"),
get_vars(., ind3) %>% fsum(keep.g = FALSE) %>% add_stub("sum_")
)
}
Added function as_integer_factor()
to turn factors/factor columns into integer vectors. as_numeric_factor()
already exists, but is memory inefficient for most factors where levels can be integers.
join()
now internally checks if the rows of the joined datasets match exactly. This check, using identical(m, seq_row(y))
, is inexpensive, but, if TRUE
, saves a full subset and deep copy of y
. Thus join()
now inherits the intelligence already present in functions like fsubset()
, roworder()
and funique()
- a key for efficient data manipulation is simply doing less.
In join()
, if attr = TRUE
, the count
option to fmatch()
is always invoked, so that the attribute attached always has the same form, regardless of verbose
or validate
settings.
roworder[v]()
has optional setting verbose = 2L
to indicate if x
is already sorted, making the call to roworder[v]()
redundant.
collapse now explicitly supports xts/zoo and units objects and concurrently removes an additional check in the .default
method of statistical functions that called the matrix method if is.matrix(x) && !inherits(x, "matrix")
. This was a smart solution to account for the fact that xts objects are matrix-based but don't inherit the "matrix"
class, thus wrongly calling the default method. The same is the case for units, but here, my recent more intensive engagement with spatial data convinced me that this should be changed. For one, under the previous heuristic solution, it was not possible to call the default method on a units matrix, e.g., fmean.default(st_distance(points_sf))
called fmean.matrix()
and yielded a vector. This should not be the case. Secondly, aggregation e.g. fmean(st_distance(points_sf))
or fmean(st_distance(points_sf), g = group_vec)
yielded a plain numeric object that lost the units class (in line with the general attribute handling principles). Therefore, I have now decided to remove the heuristic check within the default methods, and explicitly support zoo and units objects. For Fast Statistical Functions, the methods are FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ...)
and FUN.units <- function(x, ...) if(is.matrix(x)) copyMostAttrib(FUN.matrix(x, ...), x) else FUN.default(x, ...)
. While the behavior for xts/zoo remains the same, the behavior for units is enhanced, as now the class is preserved in aggregations (the .default
method preserves attributes except for ts), and it is possible to manually invoke the .default
method on a units matrix and obtain an aggregate statistic. This change may impact computations on other matrix based classes which don't inherit from "matrix"
(mts does inherit from "matrix"
, and I am not aware of any other affected classes, but user code like m <- matrix(rnorm(25), 5); class(m) <- "bla"; fmean(m)
will now yield a scalar instead of a vector. Such code must be adjusted to either class(m) <- c("bla", "matrix")
or fmean.matrix(m)
). Overall, the change makes collapse behave in a more standard and predictable way, and enhances its support for units objects central in the sf ecosystem.
fquantile()
now also preserves the attributes of the input, in line with quantile()
.
An article on collapse has been submitted to the Journal of Statistical Software. The preprint is available through arXiv.
Removed magrittr from most documentation examples (using base pipe).
Improved plot.GRP
a little bit - on request of JSS editors.
Fixed a bug in fmatch()
when matching integer vectors to factors. This also affected join()
.
Improved cross-platform compatibility of OpenMP flags. Thanks @kalibera.
Added stub = TRUE
argument to the grouped_df methods of Fast Statistical Functions supporting weights, to be able to remove or alter prefixes given to aggregated weights columns if keep.w = TRUE
. Globally, users can set st_collapse(stub = FALSE)
to disable this prefixing in all statistical functions and operators.
set_collapse()
also supports options 'digits', 'verbose' and 'stable.algo', enhancing the global configurability of collapse.
qM()
now also has a row.names.col
argument in the second position allowing generation of rownames when converting data frame-like objects to matrix e.g. qM(iris, "Species")
or qM(GGDC10S, 1:5)
(interaction of id's).
as_factor_GRP()
and finteraction()
now have an argument sep = "."
denoting the separator used for compound factor labels.
alloc()
now has an additional argument simplify = TRUE
. FALSE
always returns list output.
frename()
supports both new = old
(pandas, used to far) and old = new
(dplyr) style renaming conventions.
across()
supports negative indices, also in grouped settings: these will select all variables apart from grouping variables.
TRA()
allows shorthands "NA"
for "replace_NA"
and "fill"
for "replace_fill"
.
group()
experienced a minor speedup with >= 2 vectors as the first two vectors are now hashed jointly.
fquantile()
with names = TRUE
adds up to 1 digit after the comma in the percent-names, e.g. fquantile(airmiles, probs = 0.001)
generates appropriate names (not 0% as in the previous version).
New vignette on collapse's Handling of R Objects: provides an overview of collapse’s (internal) class-agnostic R programming framework.
print.descr()
with groups and option perc = TRUE
(the default) also shows percentages of the group frequencies for each variable.
funique(mtcars[NULL, ], sort = TRUE)
gave an error (for data frame with zero rows). Thanks @NicChr (#406).
Added SIMD vectorization for fsubset()
.
vlengths()
now also works for strings, and is hence a much faster version of both lengths()
and nchar()
. Also for atomic vectors the behavior is like lengths()
, e.g. vlengths(rnorm(10))
gives rep(1L, 10)
.
In collap[v/g]()
, the ...
argument is now placed after the custom
argument instead of after the last argument, in order to better guard against unwanted partial argument matching. In particular, previously the n
argument passed to fnth
was partially matched to na.last
. Thanks @ummel for alerting me of this (#421).
Using DATAPTR_RO
to point to R lists because of the use of ALTLISTS
on R-devel.
Replacing !=
loop controls for SIMD loops with <
to ensure compatibility on all platforms. Thanks @albertus82 (#399).
Improvements in get_elem()/has_elem()
: Option invert = TRUE
is implemented more robustly, and a function passed to get_elem()/has_elem()
is now applied to all elements in the list, including elements that are themselves list-like. This enables the use of inherits
to find list-like objects inside a broader list structure e.g. get_elem(l, inherits, what = "lm")
fetches all linear model objects inside l
.
Fixed a small bug in descr()
introduced in v1.9.0, producing an error if a data frame contained no numeric columns - because an internal function was not defined in that case. Also, POSIXct columns are handled better in print - preserving the time zone (thanks @cdignam-chwy #392).
fmean()
and fsum()
with g = NULL
, as well as TRA()
, setop()
, and related operators %r+%
, %+=%
etc., setv()
and fdist()
now utilize Single Instruction Multiple Data (SIMD) vectorization by default (if OpenMP is enabled), enabling potentially very fast computing speeds. Whether these instructions are utilized during compilation depends on your system. In general, if you want to max out collapse on your system, consider compiling from source with CFLAGS += -O3 -march=native -fopenmp
and CXXFLAGS += -O3 -march=native
in your .R/Makevars
.
Added functions fduplicated()
and any_duplicated()
, for vectors and lists / data frames. Thanks @NicChr (#373)
sort
option added to set_collapse()
to be able to set unordered grouping as a default. E.g. setting set_collapse(sort = FALSE)
will affect collap()
, BY()
, GRP()
, fgroup_by()
, qF()
, qG()
, finteraction()
, qtab()
and internal use of these functions for ad-hoc grouping in fast statistical functions. Other uses of sort
, for example in funique()
where the default is sort = FALSE
, are not affected by the global default setting.
Fixed a small bug in group()
/ funique()
resulting in an unnecessary memory allocation error in rare cases. Thanks @NicChr (#381).
Further fix to an Address Sanitizer issue as required by CRAN (eliminating an unused out of bounds access at the end of a loop).
qsu()
finally has a grouped_df method.
Added options option("collapse_nthreads")
and option("collapse_na.rm")
, which allow you to load collapse with different defaults e.g. through an .Rprofile
or .fastverse
configuration file. Once collapse is loaded, these options take no effect, and users need to use set_collapse()
to change .op[["nthreads"]]
and .op[["na.rm"]]
interactively.
Exported method plot.psmat()
(can be useful to plot time series matrices).
Fixed minor C/C++ issues flagged by CRAN's detailed checks.
Added functions set_collapse()
and get_collapse()
, allowing you to globally set defaults for the nthreads
and na.rm
arguments to all functions in the package. E.g. set_collapse(nthreads = 4, na.rm = FALSE)
could be a suitable setting for larger data without missing values. This is implemented using an internal environment by the name of .op
, such that these defaults are received using e.g. .op[["nthreads"]]
, at the computational cost of a few nanoseconds (8-10x faster than getOption("nthreads")
which would take about 1 microsecond). .op
is not accessible by the user, so function get_collapse()
can be used to retrieve settings. Exempt from this are functions .quantile
, and a new function .range
(alias of frange
), which go directly to C for maximum performance in repeated executions, and are not affected by these global settings. Function descr()
, which internally calls a bunch of statistical functions, is also not affected by these settings.
Further improvements in thread safety for fsum()
and fmean()
in grouped computations across data frame columns. All OpenMP enabled functions in collapse can now be considered thread safe i.e. they pass the full battery of tests in multithreaded mode.
collapse 1.9.0 released mid of January 2023, provides improvements in performance and versatility in many areas, as well as greater statistical capabilities, most notably efficient (grouped, weighted) estimation of sample quantiles.
All functions renamed in collapse 1.6.0 are now depreciated, to be removed end of 2023. These functions had already been giving messages since v1.6.0. See help("collapse-renamed")
.
The lead operator F()
is not exported anymore from the package namespace, to avoid clashes with base::F
flagged by multiple people. The operator is still part of the package and can be accessed using collapse:::F
. I have also added an option "collapse_export_F"
, such that setting options(collapse_export_F = TRUE)
before loading the package exports the operator as before. Thanks @matthewross07 (#100), @edrubin (#194), and @arthurgailes (#347).
Function fnth()
has a new default ties = "q7"
, which gives the same result as quantile(..., type = 7)
(R's default). More details below.
fmode()
gave wrong results for singleton groups (groups of size 1) on unsorted data. I had optimized fmode()
for singleton groups to directly return the corresponding element, but it did not access the element through the (internal) ordering vector, so the first element/row of the entire vector/data was taken. The same mistake occurred for fndistinct
if singleton groups were NA
, which were counted as 1
instead of 0
under the na.rm = TRUE
default (provided the first element of the vector/data was not NA
). The mistake did not occur with data sorted by the groups, because here the data pointer already pointed to the first element of the group. (My apologies for this bug, it took me more than half a year to discover it, using collapse on a daily basis, and it escaped 700 unit tests as well).
Function groupid(x, na.skip = TRUE)
returned uninitialized first elements if the first values in x
where NA
. Thanks for reporting @Henrik-P (#335).
Fixed a bug in the .names
argument to across()
. Passing a naming function such as .names = function(c, f) paste0(c, "-", f)
now works as intended i.e. the function is applied to all combinations of columns (c) and functions (f) using outer()
. Previously this was just internally evaluated as .names(cols, funs)
, which did not work if there were multiple cols and multiple funs. There is also now a possibility to set .names = "flip"
, which names columns f_c
instead of c_f
.
fnrow()
was rewritten in C and also supports data frames with 0 columns. Similarly for seq_row()
. Thanks @NicChr (#344).
Added functions fcount()
and fcountv()
: a versatile and blazing fast alternative to dplyr::count
. It also works with vectors, matrices, as well as grouped and indexed data.
Added function fquantile()
: Fast (weighted) continuous quantile estimation (methods 5-9 following Hyndman and Fan (1996)), implemented fully in C based on quickselect and radixsort algorithms, and also supports an ordering vector as optional input to speed up the process. It is up to 2x faster than stats::quantile
on larger vectors, but also especially fast on smaller data, where the R overhead of stats::quantile
becomes burdensome. For maximum performance during repeated executions, a programmers version .quantile()
with different defaults is also provided.
Added function fdist()
: A fast and versatile replacement for stats::dist
. It computes a full euclidean distance matrix around 4x faster than stats::dist
in serial mode, with additional gains possible through multithreading along the distance matrix columns (decreasing thread loads as the matrix is lower triangular). It also supports computing the distance of a matrix with a single row-vector, or simply between two vectors. E.g. fdist(mat, mat[1, ])
is the same as sqrt(colSums((t(mat) - mat[1, ])^2)))
, but about 20x faster in serial mode, and fdist(x, y)
is the same as sqrt(sum((x-y)^2))
, about 3x faster in serial mode. In both cases (sub-column level) multithreading is available. Note that fdist
does not skip missing values i.e. NA
's will result in NA
distances. There is also no internal implementation for integers or data frames. Such inputs will be coerced to numeric matrices.
Added function GRPid()
to easily fetch the group id from a grouping object, especially inside grouped fmutate()
calls. This addition was warranted especially by the new improved fnth.default()
method which allows orderings to be supplied for performance improvements. See commends on fnth()
and the example provided below.
fsummarize()
was added as a synonym to fsummarise
. Thanks @arthurgailes for the PR.
C API: collapse exports around 40 C functions that provide functionality that is either convenient or rather complicated to implement from scratch. The exported functions can be found at the bottom of src/ExportSymbols.c
. The API does not include the Fast Statistical Functions, which I thought are too closely related to how collapse works internally to be of much use to a C programmer (e.g. they expect grouping objects or certain kinds of integer vectors). But you are free to request the export of additional functions, including C++ functions.
fnth()
and fmedian()
were rewritten in C, with significant gains in performance and versatility. Notably, fnth()
now supports (grouped, weighted) continuous quantile estimation like fquantile()
(fmedian()
, which is a wrapper around fnth()
, can also estimate various quantile based weighted medians). The new default for fnth()
is ties = "q7"
, which gives the same result as (f)quantile(..., type = 7)
(R's default). OpenMP multithreading across groups is also much more effective in both the weighted and unweighted case. Finally, fnth.default
gained an additional argument o
to pass an ordering vector, which can dramatically speed up repeated invocations of the function on the dame data:
# Estimating multiple weighted-grouped quantiles on mpg: pre-computing an ordering provides extra speed.
mtcars %>% fgroup_by(cyl, vs, am) %>%
fmutate(o = radixorder(GRPid(), mpg)) %>% # On grouped data, need to account for GRPid()
fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o, w = wt),
mpg_median = fmedian(mpg, o = o, w = wt),
mpg_Q3 = fnth(mpg, 0.75, o = o, w = wt))
# Note that without weights this is not always faster. Quickselect can be very efficient, so it depends
# on the data, the number of groups, whether they are sorted (which speeds up radixorder), etc...
BY
now supports data-length arguments to be passed e.g. BY(mtcars, mtcars$cyl, fquantile, w = mtcars$wt)
, making it effectively a generic grouped mapply
function as well. Furthermore, the grouped_df method now also expands grouping columns for output length > 1.
collap()
, which internally uses BY
with non-Fast Statistical Functions, now also supports arbitrary further arguments passed down to functions to be split by groups. Thus users can also apply custom weighted functions with collap()
. Furthermore, the parsing of the FUN
, catFUN
and wFUN
arguments was improved and brought in-line with the parsing of .fns
in across()
. The main benefit of this is that Fast Statistical Functions are now also detected and optimizations carried out when passed in a list providing a new name e.g. collap(data, ~ id, list(mean = fmean))
is now optimized! Thanks @ttrodrigz (#358) for requesting this.
descr()
, by virtue of fquantile
and the improvements to BY
, supports full-blown grouped and weighted descriptions of data. This is implemented through additional by
and w
arguments. The function has also been turned into an S3 generic, with a default and a 'grouped_df' method. The 'descr' methods as.data.frame
and print
also feature various improvements, and a new compact
argument to print.descr
, allowing a more compact printout. Users will also notice improved performance, mainly due to fquantile
: on the M1 descr(wlddev)
is now 2x faster than summary(wlddev)
, and 41x faster than Hmisc::describe(wlddev)
. Thanks @statzhero for the request (#355).
radixorder
is about 25% faster on characters and doubles. This also benefits grouping performance. Note that group()
may still be substantially faster on unsorted data, so if performance is critical try the sort = FALSE
argument to functions like fgroup_by
and compare.
Most list processing functions are noticeably faster, as checking the data types of elements in a list is now also done in C, and I have made some improvements to collapse's version of rbindlist()
(used in unlist2d()
, and various other places).
fsummarise
and fmutate
gained an ability to evaluate arbitrary expressions that result in lists / data frames without the need to use across()
. For example: mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(cbind(mpg, wt, carb)), names = TRUE))
or mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb)), names = TRUE))
. There is also the possibility to compute expressions using .data
e.g. mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb, .data)), names = TRUE))
yields the same thing, but is less efficient because the whole dataset (including 'cyl') is split by groups. For greater efficiency and convenience, you can pre-select columns using a global .cols
argument, e.g. mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(.data), names = TRUE), .cols = .c(mpg, wt, carb))
gives the same as above. Three Notes about this:
fmutate
, have the same length as the data (in each group)..data
is used, the entire expression (expr
) will be turned into a function of .data
(function(.data) expr
), which means columns are only available when accessed through .data
e.g. .data$col1
.fsummarise
supports computations with mixed result lengths e.g. mtcars |> fgroup_by(cyl) |> fsummarise(N = GRPN(), mean_mpg = fmean(mpg), quantile_mpg = fquantile(mpg))
, as long as all computations result in either length 1 or length k vectors, where k is the maximum result length (e.g. for fquantile
with default settings k = 5).
List extraction function get_elem()
now has an option invert = TRUE
(default FALSE
) to remove matching elements from a (nested) list. Also the functionality of argument keep.class = TRUE
is implemented in a better way, such that the default keep.class = FALSE
toggles classes from (non-matched) list-like objects inside the list to be removed.
num_vars()
has become a bit smarter: columns of class 'ts' and 'units' are now also recognized as numeric. In general, users should be aware that num_vars()
does not regard any R methods defined for is.numeric()
, it is implemented in C and simply checks whether objects are of type integer or double, and do not have a class. The addition of these two exceptions now guards against two common cases where num_vars()
may give undesirable outcomes. Note that num_vars()
is also called in collap()
to distinguish between numeric (FUN
) and non-numeric (catFUN
) columns.
Improvements to setv()
and copyv()
, making them more robust to borderline cases: integer(0)
passed to v
does nothing (instead of error), and it is also possible to pass a single real index if vind1 = TRUE
i.e. passing 1
instead of 1L
does not produce an error.
alloc()
now works with all types of objects i.e. it can replicate any object. If the input is non-atomic, atomic with length > 1 or NULL
, the output is a list of these objects, e.g. alloc(NULL, 10)
gives a length 10 list of NULL
objects, or alloc(mtcars, 10)
gives a list of mtcars
datasets. Note that in the latter case the datasets are not deep-copied, so no additional memory is consumed.
missing_cases()
and na_omit()
have gained an argument prop = 0
, indicating the proportion of values missing for the case to be considered missing/to be omitted. The default value of 0
indicates that at least 1 value must be missing. Of course setting prop = 1
indicates that all values must be missing. For data frames/lists the checking is done efficiently in C. For matrices this is currently still implemented using rowSums(is.na(X)) >= max(as.integer(prop * ncol(X)), 1L)
, so the performance is less than optimal.
missing_cases()
has an extra argument count = FALSE
. Setting count = TRUE
returns the case-wise missing value count (by cols
).
Functions frename()
and setrename()
have an additional argument .nse = TRUE
, conforming to the default non-standard evaluation of tagged vector expressions e.g. frename(mtcars, mpg = newname)
is the same as frename(mtcars, mpg = "newname")
. Setting .nse = FALSE
allows newname
to be a variable holding a name e.g. newname = "othername"; frename(mtcars, mpg = newname, .nse = FALSE)
. Another use of the argument is that a (named) character vector can now be passed to the function to rename a (subset of) columns e.g. cvec = letters[1:3]; frename(mtcars, cvec, cols = 4:6, .nse = FALSE)
(this works even with .nse = TRUE
), and names(cvec) = c("cyl", "vs", "am"); frename(mtcars, cvec, .nse = FALSE)
. Furthermore, setrename()
now also returns the renamed data invisibly, and relabel()
and setrelabel()
have also gained similar flexibility to allow (named) lists or vectors of variable labels to be passed. Note that these function have no NSE capabilities, so they work essentially like frename(..., .nse = FALSE)
.
Function add_vars()
became a bit more flexible and also allows single vectors to be added with tags e.g. add_vars(mtcars, log_mpg = log(mtcars$mpg), STD(mtcars))
, similar to cbind
. However add_vars()
continues to not replicate length 1 inputs.
Safer multithreading: OpenMP multithreading over parts of the R API is minimized, reducing errors that occurred especially when multithreading across data frame columns. Also the number of threads supplied by the user to all OpenMP enabled functions is ensured to not exceed either of omp_get_num_procs()
, omp_get_thread_limit()
, and omp_get_max_threads()
.
Fixed some warnings on rchk and newer C compilers (LLVM clang 10+).
.pseries
/ .indexed_series
methods also change the implicit class of the vector (attached after "pseries"
), if the data type changed. e.g. calling a function like fgrowth
on an integer pseries changed the data type to double, but the "integer" class was still attached after "pseries".
Fixed bad testing for SE inputs in fgroup_by()
and findex_by()
. See #320.
Added rsplit.matrix
method.
descr()
now by default also reports 10% and 90% quantiles for numeric variables (in line with STATA's detailed summary statistics), and can also be applied to 'pseries' / 'indexed_series'. Furthermore, descr()
itself now has an argument stepwise
such that descr(big_data, stepwise = TRUE)
yields computation of summary statistics on a variable-by-variable basis (and the finished 'descr' object is returned invisibly). The printed result is thus identical to print(descr(big_data), stepwise = TRUE)
, with the difference that the latter first does the entire computation whereas the former computes statistics on demand.
Function ss()
has a new argument check = TRUE
. Setting check = FALSE
allows subsetting data frames / lists with positive integers without checking whether integers are positive or in-range. For programmers.
Function get_vars()
has a new argument rename
allowing select-renaming of columns in standard evaluation programming, e.g. get_vars(mtcars, c(newname = "cyl", "vs", "am"), rename = TRUE)
. The default is rename = FALSE
, to warrant full backwards compatibility. See #327.
Added helper function setattrib()
, to set a new attribute list for an object by reference + invisible return. This is different from the existing function setAttrib()
(note the capital A), which takes a shallow copy of list-like objects and returns the result.
flm
and fFtest
are now internal generic with an added formula method e.g. flm(mpg ~ hp + carb, mtcars, weights = wt)
or fFtest(mpg ~ hp + carb | vs + am, mtcars, weights = wt)
in addition to the programming interface. Thanks to Grant McDermott for suggesting.
Added method as.data.frame.qsu
, to efficiently turn the default array outputs from qsu()
into tidy data frames.
Major improvements to setv
and copyv
, generalizing the scope of operations that can be performed to all common cases. This means that even simple base R operations such as X[v] <- R
can now be done significantly faster using setv(X, v, R)
.
n
and qtab
can now be added to options("collapse_mask")
e.g. options(collapse_mask = c("manip", "helper", "n", "qtab"))
. This will export a function n()
to get the (group) count in fsummarise
and fmutate
(which can also always be done using GRPN()
but n()
is more familiar to dplyr users), and will mask table()
with qtab()
, which is principally a fast drop-in replacement, but with some different further arguments.
Added C-level helper function all_funs
, which fetches all the functions called in an expression, similar to setdiff(all.names(x), all.vars(x))
but better because it takes account of the syntax. For example let x = quote(sum(sum))
i.e. we are summing a column named sum
. Then all.names(x) = c("sum", "sum")
and all.vars(x) = "sum"
so that the difference is character(0)
, whereas all_funs(x)
returns "sum"
. This function makes collapse smarter when parsing expressions in fsummarise
and fmutate
and deciding which ones to vectorize.
sort.row
(replaced by sort
in 2020) is now removed from collap
. Also arguments return.order
and method
were added to collap
providing full control of the grouping that happens internally.Tests needed to be adjusted for the upcoming release of dplyr 1.0.8 which involves an API change in mutate
. fmutate
will not take over these changes i.e. fmutate(..., .keep = "none")
will continue to work like dplyr::transmute
. Furthermore, no more tests involving dplyr are run on CRAN, and I will also not follow along with any future dplyr API changes.
The C-API macro installTrChar
(used in the new massign
function) was replaced with installChar
to maintain backwards compatibility with R versions prior to 3.6.0. Thanks @tedmoorman #213.
Minor improvements to group()
, providing increased performance for doubles and also increased performance when the second grouping variable is integer, which turned out to be very slow in some instances.
Removed tests involving the weights package (which is not available on R-devel CRAN checks).
fgroup_by
is more flexible, supporting computing columns e.g. fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10)
and various programming options e.g. fgroup_by(GGDC10S, 1:3)
, fgroup_by(GGDC10S, c("Variable", "Country"))
, or fgroup_by(GGDC10S, is.character)
. You can also use column sequences e.g. fgroup_by(GGDC10S, Country:Variable, Year)
, but this should not be mixed with computing columns. Compute expressions may also not include the :
function.
More memory efficient attribute handling in C/C++ (using C-API macro SHALLOW_DUPLICATE_ATTRIB
instead of DUPLICATE_ATTRIB
) in most places.
order
instead of sort
in function GRP
(from a very early version of collapse), is now disabled.fvar
, fsd
, fscale
and qsu
) to calculate variances, occurring when initial or final zero weights caused the running sum of weights in the algorithm to be zero, yielding a division by zero and NA
as output although a value was expected. These functions now skip zero weights alongside missing weights, which also implies that you can pass a logical vector to the weights argument to very efficiently calculate statistics on a subset of data (e.g. using qsu
).Function group
was added, providing a low-level interface to a new unordered grouping algorithm based on hashing in C and optimized for R's data structures. The algorithm was heavily inspired by the great kit
package of Morgan Jacob, and now feeds into the package through multiple central functions (including GRP
/ fgroup_by
, funique
and qF
) when invoked with argument sort = FALSE
. It is also used in internal groupings performed in data transformation functions such as fwithin
(when no factor or 'GRP' object is provided to the g
argument). The speed of the algorithm is very promising (often superior to radixorder
), and it could be used in more places still. I welcome any feedback on its performance on different datasets.
Function gsplit
provides an efficient alternative to split
based on grouping objects. It is used as a new backend to rsplit
(which also supports data frame) as well as BY
, collap
, fsummarise
and fmutate
- for more efficient grouped operations with functions external to the package.
Added multiple functions to facilitate memory efficient programming (written in C). These include elementary mathematical operations by reference (setop
, %+=%
, %-=%
, %*=%
, %/=%
), supporting computations involving integers and doubles on vectors, matrices and data frames (including row-wise operations via setop
) with no copies at all. Furthermore a set of functions which check a single value against a vector without generating logical vectors: whichv
, whichNA
(operators %==%
and %!=%
which return indices and are significantly faster than ==
, especially inside functions like fsubset
), anyv
and allv
(allNA
was already added before). Finally, functions setv
and copyv
speed up operations involving the replacement of a value (x[x == 5] <- 6
) or of a sequence of values from a equally sized object (x[x == 5] <- y[x == 5]
, or x[ind] <- y[ind]
where ind
could be pre-computed vectors or indices) in vectors and data frames without generating any logical vectors or materializing vector subsets.
Function vlengths
was added as a more efficient alternative to lengths
(without method dispatch, simply coded in C).
Function massign
provides a multivariate version of assign
(written in C, and supporting all basic vector types). In addition the operator %=%
was added as an efficient multiple assignment operator. (It is called %=%
and not %<-%
to facilitate the translation of Matlab or Python codes into R, and because the zeallot package already provides multiple-assignment operators (%<-%
and %->%
), which are significantly more versatile, but orders of magnitude slower than %=%
)
Fully fledged fmutate
function that provides functionality analogous to dplyr::mutate
(sequential evaluation of arguments, including arbitrary tagged expressions and across
statements). fmutate
is optimized to work together with the packages Fast Statistical and Data Transformation Functions, yielding fast, vectorized execution, but also benefits from gsplit
for other operations.
across()
function implemented for use inside fsummarise
and fmutate
. It is also optimized for Fast Statistical and Data Transformation Functions, but performs well with other functions too. It has an additional arguments .apply = FALSE
which will apply functions to the entire subset of the data instead of individual columns, and thus allows for nesting tibbles and estimating models or correlation matrices by groups etc.. across()
also supports an arbitrary number of additional arguments which are split and evaluated by groups if necessary. Multiple across()
statements can be combined with tagged vector expressions in a single call to fsummarise
or fmutate
. Thus the computational framework is pretty general and similar to data.table, although less efficient with big datasets.
Added functions relabel
and setrelabel
to make interactive dealing with variable labels a bit easier. Note that both functions operate by reference. (Through vlabels<-
which is implemented in C. Taking a shallow copy of the data frame is useless in this case because variable labels are attributes of the columns, not of the frame). The only difference between the two is that setrelabel
returns the result invisibly.
function shortcuts rnm
and mtt
added for frename
and fmutate
. across
can also be abbreviated using acr
.
Added two options that can be invoked before loading of the package to change the namespace: options(collapse_mask = c(...))
can be set to export copies of selected (or all) functions in the package that start with f
removing the leading f
e.g. fsubset
-> subset
(both fsubset
and subset
will be exported). This allows masking base R and dplyr functions (even basic functions such as sum
, mean
, unique
etc. if desired) with collapse's fast functions, facilitating the optimization of existing codes and allowing you to work with collapse using a more natural namespace. The package has been internally insulated against such changes, but of course they might have major effects on existing codes. Also options(collapse_F_to_FALSE = FALSE)
can be invoked to get rid of the lead operator F
, which masks base::F
(an issue raised by some people who like to use T
/F
instead of TRUE
/FALSE
). Read the help page ?collapse-options
for more information.
Package loads faster (because I don't fetch functions from some other C/C++ heavy packages in .onLoad
anymore, which implied unnecessary loading of a lot of DLLs).
fsummarise
is now also fully featured supporting evaluation of arbitrary expressions and across()
statements. Note that mixing Fast Statistical Functions with other functions in a single expression can yield unintended outcomes, read more at ?fsummarise
.
funique
benefits from group
in the default sort = FALSE
, configuration, providing extra speed and unique values in first-appearance order in both the default and the data frame method, for all data types.
Function ss
supports both empty i
or j
.
The printout of fgroup_by
also shows minimum and maximum group size for unbalanced groupings.
In ftransformv/settransformv
and fcomputev
, the vars
argument is also evaluated inside the data frame environment, allowing NSE specifications using column names e.g. ftransformv(data, c(col1, col2:coln), FUN)
.
qF
with option sort = FALSE
now generates factors with levels in first-appearance order (instead of a random order assigned by the hash function), and can also be called on an existing factor to recast the levels in first-appearance order. It is also faster with sort = FALSE
(thanks to group
).
finteraction
has argument sort = FALSE
to also take advantage of group
.
rsplit
has improved performance through gsplit
, and an additional argument use.names
, which can be used to return an unnamed list.
Speedup in vtypes
and functions num_vars
, cat_vars
, char_vars
, logi_vars
and fact_vars
. Note than num_vars
behaves slightly differently as discussed above.
vlabels(<-)
/ setLabels
rewritten in C, giving a ~20x speed improvement. Note that they now operate by reference.
vlabels
, vclasses
and vtypes
have a use.names
argument. The default is TRUE
(as before).
colorder
can rename columns on the fly and also has a new mode pos = "after"
to place all selected columns after the first selected one, e.g.: colorder(mtcars, cyl, vs_new = vs, am, pos = "after")
. The pos = "after"
option was also added to roworderv
.
add_stub
and rm_stub
have an additional cols
argument to apply a stub to certain columns only e.g. add_stub(mtcars, "new_", cols = 6:9)
.namlab
has additional arguments N
and Ndistinct
, allowing to display number of observations and distinct values next to variable names, labels and classes, to get a nice and quick overview of the variables in a large dataset.
copyMostAttrib
only copies the "row.names"
attribute when known to be valid.
na_rm
can now be used to efficiently remove empty or NULL
elements from a list.
flag
, fdiff
and fgrowth
produce less messages (i.e. no message if you don't use a time variable in grouped operations, and messages about computations on highly irregular panel data only if data length exceeds 10 million obs.).
The print methods of pwcor
and pwcov
now have a return
argument, allowing users to obtain the formatted correlation matrix, for exporting purposes.
replace_NA
, recode_num
and recode_char
have improved performance and an additional argument set
to take advantage of setv
to change (some) data by reference. For replace_NA
, this feature is mature and setting set = TRUE
will modify all selected columns in place and return the data invisibly. For recode_num
and recode_char
only a part of the transformations are done by reference, thus users will still have to assign the data to preserve changes. In the future, this will be improved so that set = TRUE
toggles all transformations to be done by reference.
The plot method for panel series matrices and arrays plot.psmat
was improved slightly. It now supports custom colours and drawing of a grid.
settransform
and settransformv
can now be called without attaching the package e.g. collapse::settransform(data, ...)
. These errored before when collapse is not loaded because they are simply wrappers around data <- ftransform(data, ...)
. I'd like to note from a discussion that avoiding shallow copies with <-
(e.g. via :=
) does not appear to yield noticeable performance gains. Where data.table is faster on big data this mostly has to do with parallelism and sometimes with algorithms, generally not memory efficiency.
Functions setAttrib
, copyAttrib
and copyMostAttrib
only make a shallow copy of lists, not of atomic vectors (which amounts to doing a full copy and is inefficient). Thus atomic objects are now modified in-place.
Small improvements: Calling qF(x, ordered = FALSE)
on an ordered factor will remove the ordered class, the operators L
, F
, D
, Dlog
, G
, B
, W
, HDB
, HDW
and functions like pwcor
now work on unnamed matrices or data frames.
The first argument of ftransform
was renamed to .data
from X
. This was done to enable the user to transform columns named "X". For the same reason the first argument of frename
was renamed to .x
from x
(not .data
to make it explicit that .x
can be any R object with a "names" attribute). It is not possible to depreciate X
and x
without at the same time undoing the benefits of the argument renaming, thus this change is immediate and code breaking in rare cases where the first argument is explicitly set.
The function is.regular
to check whether an R object is atomic or list-like is depreciated and will be removed before the end of the year. This was done to avoid a namespace clash with the zoo package (#127).
unlist2d
produced a subsetting error if an empty list was present in the list-tree. This is now fixed, empty or NULL
elements in the list-tree are simply ignored (#99).A function fsummarize
was added to facilitate translating dplyr / data.table code to collapse. Like collap
, it is only very fast when used with the Fast Statistical Functions.
A function t_list
is made available to efficiently transpose lists of lists.
A small patch for 1.5.0 that:
Fixes a numeric precision issue when grouping doubles (e.g. before qF(wlddev$LIFEEX)
gave an error, now it works).
Fixes a minor issue with fhdwithin
when applied to pseries and fill = FALSE
.
collapse 1.5.0, released early January 2021, presents important refinements and some additional functionality.
fhdbetween / fhdwithin
functions for generalized linear projecting / partialling out. To remedy the damage caused by the removal of lfe, I had to rewrite fhdbetween / fhdwithin
to take advantage of the demeaning algorithm provided by fixest, which has some quite different mechanics. Beforehand, I made some significant changes to fixest::demean
itself to make this integration happen. The CRAN deadline was the 18th of December, and I realized too late that I would not make this. A request to CRAN for extension was declined, so collapse got archived on the 19th. I have learned from this experience, and collapse is now sufficiently insulated that it will not be taken off CRAN even if all suggested packages were removed from CRAN.numeric(0)
are fixed (thanks to @eshom and @acylam, #101). The default behavior is that all collapse functions return numeric(0)
again, except for fnobs
, fndistinct
which return 0L
, and fvar
, fsd
which return NA_real_
.Functions fhdwithin / HDW
and fhdbetween / HDB
have been reworked, delivering higher performance and greater functionality: For higher-dimensional centering and heterogeneous slopes, the demean
function from the fixest package is imported (conditional on the availability of that package). The linear prediction and partialling out functionality is now built around flm
and also allows for weights and different fitting methods.
In collap
, the default behavior of give.names = "auto"
was altered when used together with the custom
argument. Before the function name was always added to the column names. Now it is only added if a column is aggregated with two different functions. I apologize if this breaks any code dependent on the new names, but this behavior just better reflects most common use (applying only one function per column), as well as STATA's collapse.
For list processing functions like get_elem
, has_elem
etc. the default for the argument DF.as.list
was changed from TRUE
to FALSE
. This means if a nested lists contains data frame's, these data frame's will not be searched for matching elements. This default also reflects the more common usage of these functions (extracting entire data frame's or computed quantities from nested lists rather than searching / subsetting lists of data frame's). The change also delivers a considerable performance gain.
Added a set of 10 operators %rr%
, %r+%
, %r-%
, %r*%
, %r/%
, %cr%
, %c+%
, %c-%
, %c*%
, %c/%
to facilitate and speed up row- and column-wise arithmetic operations involving a vector and a matrix / data frame / list. For example X %r*% v
efficiently multiplies every row of X
with v
. Note that more advanced functionality is already provided in TRA()
, dapply()
and the Fast Statistical Functions, but these operators are intuitive and very convenient to use in matrix or matrix-style code, or in piped expressions.
Added function missing_cases
(opposite of complete.cases
and faster for data frame's / lists).
Added function allNA
for atomic vectors.
New vignette about using collapse together with data.table, available online.
flag / L / F
, fdiff / D / Dlog
and fgrowth / G
now natively support irregular time series and panels, and feature a 'complete approach' i.e. values are shifted around taking full account of the underlying time-dimension!Functions pwcor
and pwcov
can now compute weighted correlations on the pairwise or complete observations, supported by C-code that is (conditionally) imported from the weights package.
fFtest
now also supports weights.
collap
now provides an easy workaround to aggregate some columns using weights and others without. The user may simply append the names of Fast Statistical Functions with _uw
to disable weights. Example: collapse::collap(mtcars, ~ cyl, custom = list(fmean_uw = 3:4, fmean = 8:10), w = ~ wt)
aggregates columns 3 through 4 using a simple mean and columns 8 through 10 using the weighted mean.
The parallelism in collap
using parallel::mclapply
has been reworked to operate at the column-level, and not at the function level as before. It is still not available for Windows though. The default number of cores was set to mc.cores = 2L
, which now gives an error on windows if parallel = TRUE
.
function recode_char
now has additional options ignore.case
and fixed
(passed to grepl
), for enhanced recoding character data based on regular expressions.
rapply2d
now has classes
argument permitting more flexible use.
na_rm
and some other internal functions were rewritten in C. na_rm
is now 2x faster than x[!is.na(x)]
with missing values and 10x faster without missing values.
An improvement to the [.GRP_df
method enabling the use of most data.table methods (such as :=
) on a grouped data.table created with fgroup_by
.
Some documentation updates by Kevin Tappe.
collapse 1.4.1 is a small patch for 1.4.0 that:
fixes clang-UBSAN and rchk issues in 1.4.0 (minor bugs in compiled code resulting, in this case, from trying to coerce a NaN
value to integer, and failing to protect a shallow copy of a variable).
Adds a method [.GRP_df
that allows robust subsetting of grouped objects created with fgroup_by
(thanks to Patrice Kiener for flagging this).
collapse 1.4.0, released early November 2020, presents some important refinements, particularly in the domain of attribute handling, as well as some additional functionality. The changes make collapse smarter, more broadly compatible and more secure, and should not break existing code.
Deep Matrix Dispatch / Extended Time Series Support: The default methods of all statistical and transformation functions dispatch to the matrix method if is.matrix(x) && !inherits(x, "matrix")
evaluates to TRUE
. This specification avoids invoking the default method on classed matrix-based objects (such as multivariate time series of the xts / zoo class) not inheriting a 'matrix' class, while still allowing the user to manually call the default method on matrices (objects with implicit or explicit 'matrix' class). The change implies that collapse's generic statistical functions are now well suited to transform xts / zoo and many other time series and matrix-based classes.
Fully Non-Destructive Piped Workflow: fgroup_by(x, ...)
now only adds a class grouped_df, not classes table_df, tbl, grouped_df, and preserves all classes of x
. This implies that workflows such as x %>% fgroup_by(...) %>% fmean
etc. yields an object xAG
of the same class and attributes as x
, not a tibble as before. collapse aims to be as broadly compatible, class-agnostic and attribute preserving as possible.
qDF
, qDT
and qM
now have additional arguments keep.attr
and class
providing precise user control over object conversions in terms of classes and other attributes assigned / maintained. The default (keep.attr = FALSE
) yields hard conversions removing all but essential attributes from the object. E.g. before qM(EuStockMarkets)
would just have returned EuStockMarkets
(because is.matrix(EuStockMarkets)
is TRUE
) whereas now the time series class and 'tsp' attribute are removed. qM(EuStockMarkets, keep.attr = TRUE)
returns EuStockMarkets
as before.Smarter Attribute Handling: Drawing on the guidance given in the R Internals manual, the following standards for optimal non-destructive attribute handling are formalized and communicated to the user:
The default and matrix methods of the Fast Statistical Functions preserve attributes of the input in grouped aggregations ('names', 'dim' and 'dimnames' are suitably modified). If inputs are classed objects (e.g. factors, time series, checked by is.object
), the class and other attributes are dropped. Simple (non-grouped) aggregations of vectors and matrices do not preserve attributes, unless drop = FALSE
in the matrix method. An exemption is made in the default methods of functions ffirst
, flast
and fmode
, which always preserve the attributes (as the input could well be a factor or date variable).
The data frame methods are unaltered: All attributes of the data frame and columns in the data frame are preserved unless the computation result from each column is a scalar (not computing by groups) and drop = TRUE
(the default).
Transformations with functions like flag
, fwithin
, fscale
etc. are also unaltered: All attributes of the input are preserved in the output (regardless of whether the input is a vector, matrix, data.frame or related classed object). The same holds for transformation options modifying the input ("-", "-+", "/", "+", "*", "%%", "-%%") when using TRA()
function or the TRA = "..."
argument to the Fast Statistical Functions.
For TRA
'replace' and 'replace_fill' options, the data type of the STATS is preserved, not of x. This provides better results particularly with functions like fnobs
and fndistinct
. E.g. previously fnobs(letters, TRA = "replace")
would have returned the observation counts coerced to character, because letters
is character. Now the result is integer typed. For attribute handling this means that the attributes of x are preserved unless x is a classed object and the data types of x and STATS do not match. An exemption to this rule is made if x is a factor and an integer (non-factor) replacement is offered to STATS. In that case the attributes of x are copied exempting the 'class' and 'levels' attribute, e.g. so that fnobs(iris$Species, TRA = "replace")
gives an integer vector, not a (malformed) factor. In the unlikely event that STATS is a classed object, the attributes of STATS are preserved and the attributes of x discarded.
fhdwithin
/ fhdbetween
can only perform higher-dimensional centering if lfe is available. Linear prediction and centering with a single factor (among a list of covariates) is still possible without installing lfe. This change means that collapse now only depends on base R and Rcpp and is supported down to R version 2.10.Added function rsplit
for efficient (recursive) splitting of vectors and data frames.
Added function fdroplevels
for very fast missing level removal + added argument drop
to qF
and GRP.factor
, the default is drop = FALSE
. The addition of fdroplevels
also enhances the speed of the fFtest
function.
fgrowth
supports annualizing / compounding growth rates through added power
argument.
A function flm
was added for bare bones (weighted) linear regression fitting using different efficient methods: 4 from base R (.lm.fit
, solve
, qr
, chol
), using fastLm
from RcppArmadillo (if installed), or fastLm
from RcppEigen (if installed).
Added function qTBL
to quickly convert R objects to tibble.
helpers setAttrib
, copyAttrib
and copyMostAttrib
exported for fast attribute handling in R (similar to attributes<-()
, these functions return a shallow copy of the first argument with the set of attributes replaced, but do not perform checks for attribute validity like attributes<-()
. This can yield large performance gains with big objects).
helper cinv
added wrapping the expression chol2inv(chol(x))
(efficient inverse of a symmetric, positive definite matrix via Choleski factorization).
A shortcut gby
is now available to abbreviate the frequently used fgroup_by
function.
A print method for grouped data frames of any class was added.
funique
, fmode
and fndistinct
.The grouped_df methods for flag
, fdiff
, fgrowth
now also support multiple time variables to identify a panel e.g. data %>% fgroup_by(region, person_id) %>% flag(1:2, list(month, day))
.
More security features for fsubset.data.frame
/ ss
, ss
is now internal generic and also supports subsetting matrices.
In some functions (like na_omit
), passing double values (e.g. 1
instead of integer 1L
) or negative indices to the cols
argument produced an error or unexpected behavior. This is now fixed in all functions.
Fixed a bug in helper function all_obj_equal
occurring if objects are not all equal.
Some performance improvements through increased use of pointers and C API functions.
collapse 1.3.2, released mid September 2020:
Fixed a small bug in fndistinct
for grouped distinct value counts on logical vectors.
Additional security for ftransform
, which now efficiently checks the names of the data and replacement arguments for uniqueness, and also allows computing and transforming list-columns.
Added function ftransformv
to facilitate transforming selected columns with function - a very efficient replacement for dplyr::mutate_if
and dplyr::mutate_at
.
frename
now allows additional arguments to be passed to a renaming function.
collapse 1.3.1, released end of August 2020, is a patch for v1.3.0 that takes care of some unit test failures on certain operating systems (mostly because of numeric precision issues). It provides no changes to the code or functionality.
collapse 1.3.0, released mid August 2020:
dapply
and BY
now drop all unnecessary attributes if return = "matrix"
or return = "data.frame"
are explicitly requested (the default return = "same"
still seeks to preserve the input data structure).
unlist2d
now saves integer rownames if row.names = TRUE
and a list of matrices without rownames is passed, and id.factor = TRUE
generates a normal factor not an ordered factor. It is however possible to write id.factor = "ordered"
to get an ordered factor id.
fdiff
argument logdiff
renamed to log
, and taking logs is now done in R (reduces size of C++ code and does not generate as many NaN's). logdiff
may still be used, but it may be deactivated in the future. Also in the matrix and data.frame methods for flag
, fdiff
and fgrowth
, columns are only stub-renamed if more than one lag/difference/growth rate is computed.
Added fnth
for fast (grouped, weighted) n'th element/quantile computations.
Added roworder(v)
and colorder(v)
for fast row and column reordering.
Added frename
and setrename
for fast and flexible renaming (by reference).
Added function fungroup
, as replacement for dplyr::ungroup
, intended for use with fgroup_by
.
fmedian
now supports weights, computing a decently fast (grouped) weighted median based on radix ordering.
fmode
now has the option to compute min and max mode, the default is still simply the first mode.
fwithin
now supports quasi-demeaning (added argument theta
) and can thus be used to manually estimate random-effects models.
funique
is now generic with a default vector and data.frame method, providing fast unique values and rows of data. The default was changed to sort = FALSE
.
The shortcut gvr
was created for get_vars(..., regex = TRUE)
.
A helper .c
was introduced for non-standard concatenation (i.e. .c(a, b) == c("a", "b")
).
fmode
and fndistinct
have become a bit faster.
fgroup_by
now preserves data.table's.
ftransform
now also supports a data.frame as replacement argument, which automatically replaces matching columns and adds unmatched ones. Also ftransform<-
was created as a more formal replacement method for this feature.
collap
columns selected through cols
argument are returned in the order selected if keep.col.order = FALSE
. Argument sort.row
is depreciated, and replace by argument sort
. In addition the decreasing
and na.last
arguments were added and handed down to GRP.default
.
radixorder
'sorted' attribute is now always attached.
stats::D
which is masked when collapse is attached, is now preserved through methods D.expression
and D.call
.
GRP
option call = FALSE
to omit a call to match.call
-> minor performance improvement.
Several small performance improvements through rewriting some internal helper functions in C and reworking some R code.
Performance improvements for some helper functions, setRownames
/ setColnames
, na_insert
etc.
Increased scope of testing statistical functions. The functionality of the package is now secured by 7700 unit tests covering all central bits and pieces.
collapse 1.2.1, released end of May 2020:
Minor fixes for 1.2.0 issues that prevented correct installation on Mac OS X and a vignette rebuilding error on solaris.
fmode.grouped_df
with groups and weights now saves the sum of the weights instead of the max (this makes more sense as the max only applies if all elements are unique).
collapse 1.2.0, released mid May 2020:
grouped_df methods for fast statistical functions now always attach the grouping variables to the output in aggregations, unless argument keep.group_vars = FALSE
. (formerly grouping variables were only attached if also present in the data. Code hinged on this feature should be adjusted)
qF
ordered
argument default was changed to ordered = FALSE
, and the NA
level is only added if na.exclude = FALSE
. Thus qF
now behaves exactly like as.factor
.
Recode
is depreciated in favor of recode_num
and recode_char
, it will be removed soon. Similarly replace_non_finite
was renamed to replace_Inf
.
In mrtl
and mctl
the argument ret
was renamed return
and now takes descriptive character arguments (the previous version was a direct C++ export and unsafe, code written with these functions should be adjusted).
GRP
argument order
is depreciated in favor of argument decreasing
. order
can still be used but will be removed at some point.
flag
where unused factor levels caused a group size error.Added a suite of functions for fast data manipulation:
fselect
selects variables from a data frame and is equivalent but much faster than dplyr::select
.fsubset
is a much faster version of base::subset
to subset vectors, matrices and data.frames. The function ss
was also added as a faster alternative to [.data.frame
.ftransform
is a much faster update of base::transform
, to transform data frames by adding, modifying or deleting columns. The function settransform
does all of that by reference.fcompute
is equivalent to ftransform
but returns a new data frame containing only the columns computed from an existing one.na_omit
is a much faster and enhanced version of base::na.omit
.replace_NA
efficiently replaces missing values in multi-type data.Added function fgroup_by
as a much faster version of dplyr::group_by
based on collapse grouping. It attaches a 'GRP' object to a data frame, but only works with collapse's fast functions. This allows dplyr like manipulations that are fully collapse based and thus significantly faster, i.e. data %>% fgroup_by(g1,g2) %>% fselect(cola,colb) %>% fmean
. Note that data %>% dplyr::group_by(g1,g2) %>% dplyr::select(cola,colb) %>% fmean
still works, in which case the dplyr 'group' object is converted to 'GRP' as before. However data %>% fgroup_by(g1,g2) %>% dplyr::summarize(...)
does not work.
Added function varying
to efficiently check the variation of multi-type data over a dimension or within groups.
Added function radixorder
, same as base::order(..., method = "radix")
but more accessible and with built-in grouping features.
Added functions seqid
and groupid
for generalized run-length type id variable generation from grouping and time variables. seqid
in particular strongly facilitates lagging / differencing irregularly spaced panels using flag
, fdiff
etc.
fdiff
now supports quasi-differences i.e. $x_t - \rho x_{t-1}$ and quasi-log differences i.e. $log(x_t) - \rho log(x_{t-1})$. an arbitrary $\rho$ can be supplied.
Added a Dlog
operator for faster access to log-differences.
Faster grouping with GRP
and faster factor generation with added radix method + automatic dispatch between hash and radix method. qF
is now ~ 5x faster than as.factor
on character and around 30x faster on numeric data. Also qG
was enhanced.
Further slight speed tweaks here and there.
collap
now provides more control for weighted aggregations with additional arguments w
, keep.w
and wFUN
to aggregate the weights as well. The defaults are keep.w = TRUE
and wFUN = fsum
. A specialty of collap
remains that keep.by
and keep.w
also work for external objects passed, so code of the form collap(data, by, FUN, catFUN, w = data$weights)
will now have an aggregated weights
vector in the first column.
qsu
now also allows weights to be passed in formula i.e. qsu(data, by = ~ group, pid = ~ panelid, w = ~ weights)
.
fgrowth
has a scale
argument, the default is scale = 100
which provides growth rates in percentage terms (as before), but this may now be changed.
All statistical and transformation functions now have a hidden list method, so they can be applied to unclassed list-objects as well. An error is however provided in grouped operations with unequal-length columns.
collapse 1.1.0 released early April 2020:
Fixed remaining gcc10, LTO and valgrind issues in C/C++ code, and added some more tests (there are now ~ 5300 tests ensuring that collapse statistical functions perform as expected).
Fixed the issue that supplying an unnamed list to GRP()
, i.e. GRP(list(v1, v2))
would give an error. Unnamed lists are now automatically named 'Group.1', 'Group.2', etc...
Fixed an issue where aggregating by a single id in collap()
(i.e. collap(data, ~ id1)
), the id would be coded as factor in the aggregated data.frame. All variables including id's now retain their class and attributes in the aggregated data.
Added weights (w
) argument to fsum
and fprod
.
Added an argument mean = 0
to fwithin / W
. This allows simple and grouped centering on an arbitrary mean, 0
being the default. For grouped centering mean = "overall.mean"
can be specified, which will center data on the overall mean of the data. The logical argument add.global.mean = TRUE
used to toggle this in collapse 1.0.0 is therefore depreciated.
Added arguments mean = 0
(the default) and sd = 1
(the default) to fscale / STD
. These arguments now allow to (group) scale and center data to an arbitrary mean and standard deviation. Setting mean = FALSE
will just scale data while preserving the mean(s). Special options for grouped scaling are mean = "overall.mean"
(same as fwithin / W
), and sd = "within.sd"
, which will scale the data such that the standard deviation of each group is equal to the within- standard deviation (= the standard deviation computed on the group-centered data). Thus group scaling a panel-dataset with mean = "overall.mean"
and sd = "within.sd"
harmonizes the data across all groups in terms of both mean and variance. The fast algorithm for variance calculation toggled with stable.algo = FALSE
was removed from fscale
. Welford's numerically stable algorithm used by default is fast enough for all practical purposes. The fast algorithm is still available for fvar
and fsd
.
Added the modulus (%%
) and subtract modulus (-%%
) operations to TRA()
.
Added the function finteraction
, for fast interactions, and as_character_factor
to coerce a factor, or all factors in a list, to character (analogous to as_numeric_factor
). Also exported the function ckmatch
, for matching with error message showing non-matched elements.
First version of the package featuring only the functions collap
and qsu
based on code shared by Sebastian Krantz on R-devel, February 2019.
Major rework of the package using Rcpp and data.table internals, introduction of fast statistical functions and operators and expansion of the scope of the package to a broad set of data transformation and exploration tasks. Several iterations of enhancing speed of R code. Seamless integration of collapse with dplyr, plm and data.table. CRAN release of collapse 1.0.0 on 19th March 2020.