The first argument of
ftransform was renamed to
X. This was done to enable the user to transform columns named “X”. For the same reason the first argument of
frename was renamed to
.data to make it explicit that
.x can be any R object with a “names” attribute). It is not possible to depreciate
x without at the same time undoing the benefits of the argument renaming, thus this change is immediate and code breaking in rare cases where the first argument is explicitly set.
is.regular to check whether an R object is atomic or list-like is depreciated and will be removed before the end of the year. This was done to avoid a namespace clash with the zoo package (#127).
SHALLOW_DUPLICATE_ATTRIBto copy column attributes in a data frame. Since this macro does not copy S4 object bits, it caused some problems with S4 object columns such as POSIXct (e.g. computing lags/leads, first and last values on these columns). This is now fixed, all statistical functions (apart from
fsd) now use
DUPLICATE_ATTRIBand thus preserve S4 object columns (#91).
unlist2dproduced a subsetting error if an empty list was present in the list-tree. This is now fixed, empty or
NULLelements in the list-tree are simply ignored (#99).
fsummarize was added to facilitate translating dplyr / data.table code to collapse. Like
collap, it is only very fast when used with the Fast Statistical Functions.
t_list is made available to efficiently transpose lists of lists.
A small patch for 1.5.0 that:
Fixes a numeric precision issue when grouping doubles (e.g. before
qF(wlddev$LIFEEX) gave an error, now it works).
Fixes a minor issue with
fHDwithin when applied to pseries and
fill = FALSE.
collapse 1.5.0, released early January 2021, presents important refinements and some additional functionality.
fHDbetween / fHDwithinfunctions for generalized linear projecting / partialling out. To remedy the damage caused by the removal of lfe, I had to rewrite
fHDbetween / fHDwithinto take advantage of the demeaning algorithm provided by fixest, which has some quite different mechanics. Beforehand, I made some significant changes to
fixest::demeanitself to make this integration happen. The CRAN deadline was the 18th of December, and I realized too late that I would not make this. A request to CRAN for extension was declined, so collapse got archived on the 19th. I have learned from this experience, and collapse is now sufficiently insulated that it will not be taken off CRAN even if all suggested packages were removed from CRAN.
fHDwithin / HDW and
fHDbetween / HDB have been reworked, delivering higher performance and greater functionality: For higher-dimensional centering and heterogenous slopes, the
demean function from the fixest package is imported (conditional on the availability of that package). The linear prediction and partialling out functionality is now built around
flm and also allows for weights and different fitting methods.
collap, the default behavior of
give.names = "auto" was altered when used together with the
custom argument. Before the function name was always added to the column names. Now it is only added if a column is aggregated with two different functions. I apologize if this breaks any code dependent on the new names, but this behavior just better reflects most common use (applying only one function per column), as well as STATA’s collapse.
For list processing functions like
has_elem etc. the default for the argument
DF.as.list was changed from
FALSE. This means if a nested lists contains data frame’s, these data frame’s will not be searched for matching elements. This default also reflects the more common usage of these functions (extracting entire data frame’s or computed quantities from nested lists rather than searching / subsetting lists of data frame’s). The change also delivers a considerable performance gain.
Added a set of 10 operators
%c/% to facilitate and speed up row- and column-wise arithmetic operations involving a vector and a matrix / data frame / list. For example
X %r*% v efficiently multiplies every row of
v. Note that more advanced functionality is already provided in
dapply() and the Fast Statistical Functions, but these operators are intuitive and very convenient to use in matrix or matrix-style code, or in piped expressions.
missing_cases (opposite of
complete.cases and faster for data frame’s / lists).
allNA for atomic vectors.
New vignette about using collapse together with data.table, available online.
flag / L / F,
fdiff / D / Dlogand
fgrowth / Gnow natively support irregular time series and panels, and feature a ‘complete approach’ i.e. values are shifted around taking full account of the underlying time-dimension!
pwcov can now compute weighted correlations on the pairwise or complete observations, supported by C-code that is (conditionally) imported from the weights package.
fFtest now also supports weights.
collap now provides an easy workaround to aggregate some columns using weights and others without. The user may simply append the names of Fast Statistical Functions with
_uw to disable weights. Example:
collapse::collap(mtcars, ~ cyl, custom = list(fmean_uw = 3:4, fmean = 8:10), w = ~ wt) aggregates columns 3 through 4 using a simple mean and columns 8 through 10 using the weighted mean.
The parallelism in
parallel::mclapply has been reworked to operate at the column-level, and not at the function level as before. It is still not available for Windows though. The default number of cores was set to
mc.cores = 2L, which now gives an error on windows if
parallel = TRUE.
recode_char now has additional options
fixed (passed to
grepl), for enhanced recoding character data based on regular expressions.
rapply2d now has
classes argument permitting more flexible use.
na_rm and some other internal functions were rewritten in C.
na_rm is now 2x faster than
x[!is.na(x)] with missing values and 10x faster without missing values.
An improvement to the
[.GRP_df method enabling the use of most data.table methods (such as
:=) on a grouped data.table created with
Some documentation updates by Kevin Tappe.
collapse 1.4.1 is a small patch for 1.4.0 that:
fixes clang-UBSAN and rchk issues in 1.4.0 (minor bugs in compiled code resulting, in this case, from trying to coerce a
NaN value to integer, and failing to protect a shallow copy of a variable).
Adds a method
[.GRP_df that allows robust subsetting of grouped objects created with
fgroup_by (thanks to Patrice Kiener for flagging this).
collapse 1.4.0, released early November 2020, presents some important refinements, particularly in the domain of attribute handling, as well as some additional functionality. The changes make collapse smarter, more broadly compatible and more secure, and should not break existing code.
Deep Matrix Dispatch / Extended Time Series Support: The default methods of all statistical and transformation functions dispatch to the matrix method if
is.matrix(x) && !inherits(x, "matrix") evaluates to
TRUE. This specification avoids invoking the default method on classed matrix-based objects (such as multivariate time series of the xts / zoo class) not inheriting a ‘matrix’ class, while still allowing the user to manually call the default method on matrices (objects with implicit or explicit ‘matrix’ class). The change implies that collapse’s generic statistical functions are now well suited to transform xts / zoo and many other time series and matrix-based classes.
Fully Non-Destructive Piped Workflow:
fgroup_by(x, ...) now only adds a class grouped_df, not classes table_df, tbl, grouped_df, and preserves all classes of
x. This implies that workflows such as
x %>% fgroup_by(...) %>% fmean etc. yields an object
xAG of the same class and attributes as
x, not a tibble as before. collapse aims to be as broadly compatible, class-agnostic and attribute preserving as possible.
qMnow have additional arguments
classproviding precise user control over object conversions in terms of classes and other attributes assigned / maintained. The default (
keep.attr = FALSE) yields hard conversions removing all but essential attributes from the object. E.g. before
qM(EuStockMarkets)would just have returned
TRUE) whereas now the time series class and ‘tsp’ attribute are removed.
qM(EuStockMarkets, keep.attr = TRUE)returns
Smarter Attribute Handling: Drawing on the guidance given in the R Internals manual, the following standards for optimal non-destructive attribute handling are formalized and communicated to the user:
The default and matrix methods of the Fast Statistical Functions preserve attributes of the input in grouped aggregations (‘names’, ‘dim’ and ‘dimnames’ are suitably modified). If inputs are classed objects (e.g. factors, time series, checked by
is.object), the class and other attributes are dropped. Simple (non-grouped) aggregations of vectors and matrices do not preserve attributes, unless
drop = FALSE in the matrix method. An exemption is made in the default methods of functions
fmode, which always preserve the attributes (as the input could well be a factor or date variable).
The data frame methods are unaltered: All attributes of the data frame and columns in the data frame are preserved unless the computation result from each column is a scalar (not computing by groups) and
drop = TRUE (the default).
Transformations with functions like
fscale etc. are also unaltered: All attributes of the input are preserved in the output (regardless of whether the input is a vector, matrix, data.frame or related classed object). The same holds for transformation options modifying the input (“-”, “-+”, “/”, “+”, “*”, “%%”, “-%%”) when using
TRA() function or the
TRA = "..." argument to the Fast Statistical Functions.
TRA ‘replace’ and ‘replace_fill’ options, the data type of the STATS is preserved, not of x. This provides better results particularly with functions like
fNdistinct. E.g. previously
fNobs(letters, TRA = "replace") would have returned the observation counts coerced to character, because
letters is character. Now the result is integer typed. For attribute handling this means that the attributes of x are preserved unless x is a classed object and the data types of x and STATS do not match. An exemption to this rule is made if x is a factor and an integer (non-factor) replacement is offered to STATS. In that case the attributes of x are copied exempting the ‘class’ and ‘levels’ attribute, e.g. so that
fNobs(iris$Species, TRA = "replace") gives an integer vector, not a (malformed) factor. In the unlikely event that STATS is a classed object, the attributes of STATS are preserved and the attributes of x discarded.
fHDbetweencan only perform higher-dimensional centering if lfe is available. Linear prediction and centering with a single factor (among a list of covariates) is still possible without installing lfe. This change means that collapse now only depends on base R and Rcpp and is supported down to R version 2.10.
rsplit for efficient (recursive) splitting of vectors and data frames.
fdroplevels for very fast missing level removal + added argument
GRP.factor, the default is
drop = FALSE. The addition of
fdroplevels also enhances the speed of the
fgrowth supports annualizing / compounding growth rates through added
flm was added for barebones (weighted) linear regression fitting using different efficient methods: 4 from base R (
fastLm from RcppArmadillo (if installed), or
fastLm from RcppEigen (if installed).
qTBL to quickly convert R objects to tibble.
copyMostAttrib exported for fast attribute handling in R (similar to
attributes<-(), these functions return a shallow copy of the first argument with the set of attributes replaced, but do not perform checks for attribute validity like
attributes<-(). This can yield large performance gains with big objects).
cinv added wrapping the expression
chol2inv(chol(x)) (efficient inverse of a symmetric, positive definite matrix via Choleski factorization).
gby is now available to abbreviate the frequently used
A print method for grouped data frames of any class was added.
The grouped_df methods for
fgrowth now also support multiple time variables to identify a panel e.g.
data %>% fgroup_by(region, person_id) %>% flag(1:2, list(month, day)).
More security features for
ss is now internal generic and also supports subsetting matrices.
In some functions (like
na_omit), passing double values (e.g.
1 instead of integer
1L) or negative indices to the
cols argument produced an error or unexpected behavior. This is now fixed in all functions.
Fixed a bug in helper function
all_obj_equal occurring if objects are not all equal.
Some performance improvements through increased use of pointers and C API functions.
collapse 1.3.2, released mid September 2020:
Fixed a small bug in
fNdistinct for grouped distinct value counts on logical vectors.
Additional security for
ftransform, which now efficiently checks the names of the data and replacement arguments for uniqueness, and also allows computing and transforming list-columns.
frename now allows additional arguments to be passed to a renaming function.
collapse 1.3.1, released end of August 2020, is a patch for v1.3.0 that takes care of some unit test failures on certain operating systems (mostly because of numeric precision issues). It provides no changes to the code or functionality.
collapse 1.3.0, released mid August 2020:
BY now drop all unnecessary attributes if
return = "matrix" or
return = "data.frame" are explicitly requested (the default
return = "same" still seeks to preserve the input data structure).
unlist2d now saves integer rownames if
row.names = TRUE and a list of matrices without rownames is passed, and
id.factor = TRUE generates a normal factor not an ordered factor. It is however possible to write
id.factor = "ordered" to get an ordered factor id.
logdiff renamed to
log, and taking logs is now done in R (reduces size of C++ code and does not generate as many NaN’s).
logdiff may still be used, but it may be deactivated in the future. Also in the matrix and data.frame methods for
fgrowth, columns are only stub-renamed if more than one lag/difference/growth rate is computed.
fnth for fast (grouped, weighted) n’th element/quantile computations.
setrename for fast and flexible renaming (by reference).
fungroup, as replacement for
dplyr::ungroup, intended for use with
fmedian now supports weights, computing a decently fast (grouped) weighted median based on radix ordering.
fmode now has the option to compute min and max mode, the default is still simply the first mode.
fwithin now supports quasi-demeaning (added argument
theta) and can thus be used to manually estimate random-effects models.
funique is now generic with a default vector and data.frame method, providing fast unique values and rows of data. The default was changed to
sort = FALSE.
gvr was created for
get_vars(..., regex = TRUE).
.c was introduced for non-standard concatenation (i.e.
.c(a, b) == c("a", "b")).
fNdistinct have become a bit faster.
fgroup_by now preserves data.table’s.
ftransform now also supports a data.frame as replacement argument, which automatically replaces matching columns and adds unmatched ones. Also
ftransform<- was created as a more formal replacement method for this feature.
collap columns selected through
cols argument are returned in the order selected if
keep.col.order = FALSE. Argument
sort.row is depreciated, and replace by argument
sort. In addition the
na.last arguments were added and handed down to
radixorder ‘sorted’ attribute is now always attached.
stats::D which is masked when collapse is attached, is now preserved through methods
call = FALSE to omit a call to
match.call -> minor performance improvement.
Several small performance improvements through rewriting some internal helper functions in C and reworking some R code.
Performance improvements for some helper functions,
Increased scope of testing statistical functions. The functionality of the package is now secured by 7700 unit tests covering all central bits and pieces.
collapse 1.2.1, released end of May 2020:
Minor fixes for 1.2.0 issues that prevented correct installation on Mac OS X and a vignette rebuilding error on solaris.
fmode.grouped_df with groups and weights now saves the sum of the weights instead of the max (this makes more sense as the max only applies if all elements are unique).
collapse 1.2.0, released mid May 2020:
grouped_df methods for fast statistical functions now always attach the grouping variables to the output in aggregations, unless argument
keep.group_vars = FALSE. (formerly grouping variables were only attached if also present in the data. Code hinged on this feature should be adjusted)
ordered argument default was changed to
ordered = FALSE, and the
NA level is only added if
na.exclude = FALSE. Thus
qF now behaves exactly like
Recode is depreciated in favor of
recode_char, it will be removed soon. Similarly
replace_non_finite was renamed to
mctl the argument
ret was renamed
return and now takes descriptive character arguments (the previous version was a direct C++ export and unsafe, code written with these functions should be adjusted).
order is depreciated in favor of argument
order can still be used but will be removed at some point.
Added a suite of functions for fast data manipulation:
fselectselects variables from a data frame and is equivalent but much faster than
fsubsetis a much faster version of
base::subsetto subset vectors, matrices and data.frames. The function
sswas also added as a faster alternative to
ftransformis a much faster update of
base::transform, to transform data frames by adding, modifying or deleting columns. The function
settransformdoes all of that by reference.
fcomputeis equivalent to
ftransformbut returns a new data frame containing only the columns computed from an existing one.
na_omitis a much faster and enhanced version of
replace_NAefficiently replaces missing values in multi-type data.
fgroup_by as a much faster version of
dplyr::group_by based on collapse grouping. It attaches a ‘GRP’ object to a data frame, but only works with collapse’s fast functions. This allows dplyr like manipulations that are fully collapse based and thus significantly faster, i.e.
data %>% fgroup_by(g1,g2) %>% fselect(cola,colb) %>% fmean. Note that
data %>% dplyr::group_by(g1,g2) %>% dplyr::select(cola,colb) %>% fmean still works, in which case the dplyr ‘group’ object is converted to ‘GRP’ as before. However
data %>% fgroup_by(g1,g2) %>% dplyr::summarize(...) does not work.
varying to efficiently check the variation of multi-type data over a dimension or within groups.
radixorder, same as
base::order(..., method = "radix") but more accessible and with built-in grouping features.
groupid for generalized run-length type id variable generation from grouping and time variables.
seqid in particular strongly facilitates lagging / differencing irregularly spaced panels using
fdiff now supports quasi-differences i.e. xt − ρxt − 1 and quasi-log differences i.e. log(xt) − ρlog(xt − 1). an arbitrary ρ can be supplied.
Dlog operator for faster access to log-differences.
Faster grouping with
GRP and faster factor generation with added radix method + automatic dispatch between hash and radix method.
qF is now ~ 5x faster than
as.factor on character and around 30x faster on numeric data. Also
qG was enhanced.
Further slight speed tweaks here and there.
collap now provides more control for weighted aggregations with additional arguments
wFUN to aggregate the weights as well. The defaults are
keep.w = TRUE and
wFUN = fsum. A specialty of
collap remains that
keep.w also work for external objects passed, so code of the form
collap(data, by, FUN, catFUN, w = data$weights) will now have an aggregated
weights vector in the first column.
qsu now also allows weights to be passed in formula i.e.
qsu(data, by = ~ group, pid = ~ panelid, w = ~ weights).
fgrowth has a
scale argument, the default is
scale = 100 which provides growth rates in percentage terms (as before), but this may now be changed.
All statistical and transformation functions now have a hidden list method, so they can be applied to unclassed list-objects as well. An error is however provided in grouped operations with unequal-length columns.
collapse 1.1.0 released early April 2020:
Fixed remaining gcc10, LTO and valgrind issues in C/C++ code, and added some more tests (there are now ~ 5300 tests ensuring that collapse statistical functions perform as expected).
Fixed an issue where aggregating by a single id in
collap(data, ~ id1)), the id would be coded as factor in the aggregated data.frame. All variables including id’s now retain their class and attributes in the aggregated data.
Added weights (
w) argument to
Added an argument
mean = 0 to
fwithin / W. This allows simple and grouped centering on an arbitrary mean,
0 being the default. For grouped centering
mean = "overall.mean" can be specified, which will center data on the overall mean of the data. The logical argument
add.global.mean = TRUE used to toggle this in collapse 1.0.0 is therefore depreciated.
mean = 0 (the default) and
sd = 1 (the default) to
fscale / STD. These arguments now allow to (group) scale and center data to an arbitrary mean and standard deviation. Setting
mean = FALSE will just scale data while preserving the mean(s). Special options for grouped scaling are
mean = "overall.mean" (same as
fwithin / W), and
sd = "within.sd", which will scale the data such that the standard deviation of each group is equal to the within- standard deviation (= the standard deviation computed on the group-centered data). Thus group scaling a panel-dataset with
mean = "overall.mean" and
sd = "within.sd" harmonizes the data across all groups in terms of both mean and variance. The fast algorithm for variance calculation toggled with
stable.algo = FALSE was removed from
fscale. Welford’s numerically stable algorithm used by default is fast enough for all practical purposes. The fast algorithm is still available for
Added the function
finteraction, for fast interactions, and
as.character_factor to coerce a factor, or all factors in a list, to character (analogous to
as.numeric_factor). Also exported the function
ckmatch, for matching with error message showing non-matched elements.
First version of the package featuring only the functions
qsu based on code shared by Sebastian Krantz on R-devel, February 2019.
Major rework of the package using Rcpp and data.table internals, introduction of fast statistical functions and operators and expansion of the scope of the package to a broad set of data transformation and exploration tasks. Several iterations of enhancing speed of R code. Seamless integration of collapse with dplyr, plm and data.table. CRAN release of collapse 1.0.0 on 19th March 2020.