collapse for tidyverse Users
Sebastian Krantz
2024-11-25
Source:vignettes/collapse_for_tidyverse_users.Rmd
collapse_for_tidyverse_users.Rmd
collapse is a C/C++ based package for data transformation and statistical computing in R that aims to enable greater performance and statistical complexity in data manipulation tasks and offers a stable, class-agnostic, and lightweight API. It is part of the core fastverse, a suite of lightweight packages with similar objectives.
The tidyverse set of packages provides a rich, expressive, and consistent syntax for data manipulation in R centering on the tibble object and tidy data principles (each observation is a row, each variable is a column).
collapse fully supports the tibble object and provides many tidyverse-like functions for data manipulation. It can thus be used to write tidyverse-like data manipulation code that, thanks to low-level vectorization of many statistical operations and optimized R code, typically runs much faster than native tidyverse code (in addition to being much more lightweight in dependencies).
Its aim is not to create a faster tidyverse, i.e., it does not implements all aspects of the rich tidyverse grammar or changes to it1, and also takes inspiration from other leading data manipulation libraries to serve broad aims of performance, parsimony, complexity, and robustness in data manipulation for R.
Namespace and Global Options
collapse data manipulation functions familiar to
tidyverse users include fselect
,
fgroup_by
, fsummarise
, fmutate
,
across
, frename
, and fcount
.
Other functions like fsubset
, ftransform
, and
get_vars
are inspired by base R, while again other
functions like join
, pivot
,
roworder
, colorder
, rowbind
, etc.
are inspired by other data manipulation libraries such as
data.table and polars.
By virtue of the f- prefixes, the collapse namespace has no conflicts with the tidyverse, and these functions can easily be substituted in a tidyverse workflow.
R users willing to replace the tidyverse have the additional
option to mask functions and eliminate the prefixes with
set_collapse
. For example
library(collapse)
set_collapse(mask = "manip") # version >= 2.0.0
makes available functions select
, group_by
,
summarise
, mutate
, rename
,
count
, subset
, and transform
in
the collapse namespace and detaches and re-attaches the
package, such that the following code is executed by
collapse:
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), mean),
qsec_wt = weighted.mean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
Note that the correct documentation still needs to be called
with prefixes, i.e., ?fsubset
. See
?set_collapse
for further options to the package, which
also includes optimization options such as nthreads
,
na.rm
, sort
, and stable.algo
.
Note also that if you use collapse’s namespace
masking, you can use fastverse::fastverse_conflicts()
to
check for namespace conflicts with other packages.
Using the Fast Statistical Functions
A key feature of collapse is that it not only provides functions for data manipulation, but also a full set of statistical functions and algorithms to speed up statistical calculations and perform more complex statistical operations (e.g. involving weights or time series data).
Notably among these, the Fast Statistical Functions is a consistent set of S3-generic statistical functions providing fully vectorized statistical operations in R.
Specifically, operations such as calculating the mean via the S3
generic fmean()
function are vectorized across columns and
groups and may also involve weights or transformations of the original
data:
fmean(mtcars$mpg) # Vector
# [1] 20.09062
fmean(EuStockMarkets) # Matrix
# DAX SMI CAC FTSE
# 2530.657 3376.224 2227.828 3565.643
fmean(mtcars) # Data Frame
# mpg cyl disp hp drat wt qsec vs am
# 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250
# gear carb
# 3.687500 2.812500
fmean(mtcars$mpg, w = mtcars$wt) # Weighted mean
# [1] 18.54993
fmean(mtcars$mpg, g = mtcars$cyl) # Grouped mean
# 4 6 8
# 26.66364 19.74286 15.10000
fmean(mtcars$mpg, g = mtcars$cyl, w = mtcars$wt) # Weighted group mean
# 4 6 8
# 25.93504 19.64578 14.80643
fmean(mtcars[5:10], g = mtcars$cyl, w = mtcars$wt) # Of data frame
# drat wt qsec vs am gear
# 4 4.031264 2.414750 19.38044 0.9148868 0.6498031 4.047250
# 6 3.569170 3.152060 18.12198 0.6212191 0.3787809 3.821036
# 8 3.205658 4.133116 16.88529 0.0000000 0.1203808 3.240762
fmean(mtcars$mpg, g = mtcars$cyl, w = mtcars$wt, TRA = "fill") # Replace data by weighted group mean
# [1] 19.64578 19.64578 25.93504 19.64578 14.80643 19.64578 14.80643 25.93504 25.93504 19.64578
# [11] 19.64578 14.80643 14.80643 14.80643 14.80643 14.80643 14.80643 25.93504 25.93504 25.93504
# [21] 25.93504 14.80643 14.80643 14.80643 14.80643 25.93504 25.93504 25.93504 14.80643 19.64578
# [31] 14.80643 25.93504
# etc...
The data manipulation functions of collapse are integrated with these Fast Statistical Functions to enable vectorized statistical operations. For example, the following code
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), fmean),
qsec_wt = fmean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
gives exactly the same result as above, but the execution is much
faster (especially on larger data), because with Fast Statistical
Functions, the data does not need to be split by groups, and there
is no need to call lapply()
inside the
across()
statement: fmean.data.frame()
is
simply applied to a subset of the data containing columns
mpg
, carb
and hp
.
The Fast Statistical Functions also have a method for
grouped data, so if we did not want to calculate the weighted mean of
qsec
, the code would simplify as follows:
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
select(mpg, carb, hp) |>
fmean()
# cyl vs am mpg carb hp
# 1 4 0 1 26.00000 2.000000 91.00000
# 2 4 1 0 22.90000 1.666667 84.66667
# 3 4 1 1 28.37143 1.428571 80.57143
# 4 6 0 1 20.56667 4.666667 131.66667
# 5 6 1 0 19.12500 2.500000 115.25000
# 6 8 0 0 15.98000 2.900000 191.00000
# 7 8 0 1 15.40000 6.000000 299.50000
Note that all functions in collapse, including the Fast
Statistical Functions, have the default na.rm = TRUE
,
i.e., missing values are skipped in calculations. This can be changed
using set_collapse(na.rm = FALSE)
to give behavior more
consistent with base R.
Another thing to be aware of when using Fast Statistical Functions inside data manipulation functions is that they toggle vectorized execution wherever they are used. E.g.
mtcars |> group_by(cyl) |> summarise(mpg = fmean(mpg) + min(qsec)) # Vectorized
# cyl mpg
# 1 4 41.16364
# 2 6 34.24286
# 3 8 29.60000
calculates a grouped mean of mpg
but adds the overall
minimum of qsec
to the result, whereas
mtcars |> group_by(cyl) |> summarise(mpg = fmean(mpg) + fmin(qsec)) # Vectorized
# cyl mpg
# 1 4 43.36364
# 2 6 35.24286
# 3 8 29.60000
mtcars |> group_by(cyl) |> summarise(mpg = mean(mpg) + min(qsec)) # Not vectorized
# cyl mpg
# 1 4 43.36364
# 2 6 35.24286
# 3 8 29.60000
both give the mean + the minimum within each group, but calculated in
different ways: the former is equivalent to
fmean(mpg, g = cyl) + fmin(qsec, g = cyl)
, whereas the
latter is equal to
sapply(gsplit(mpg, cyl), function(x) mean(x) + min(x))
.
See ?fsummarise
and ?fmutate
for more
detailed examples. This eager vectorization approach is
intentional as it allows users to vectorize complex expressions and fall
back to base R if this is not desired. This
blog post by Andrew Ghazi provides an excellent example of computing
a p-value test statistic by groups.
To take full advantage of collapse, it is highly recommended
to use the Fast Statistical Functions as much as possible. You
can also set set_collapse(mask = "all")
to replace
statistical functions in base R like sum
and
mean
with the collapse versions (toggling vectorized
execution in all cases), but this may affect other parts of your code2.
Writing Efficient Code
It is also performance-critical to correctly sequence operations and
limit excess computations. tidyverse code is often inefficient
simply because the tidyverse allows you to do everything. For
example,
mtcars |> group_by(cyl) |> filter(mpg > 13) |> arrange(mpg)
is permissible but inefficient code as it filters and reorders grouped
data, requiring modifications to both the data frame and the attached
grouping object. collapse does not allow calls to
fsubset()
on grouped data, and messages about it in
roworder()
, encouraging you to write more efficient
code.
The above example can also be optimized because we are subsetting the whole frame and then doing computations on a subset of columns. It would be more efficient to select all required columns during the subset operation:
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp, qsec, wt) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), fmean),
qsec_wt = fmean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
Without the weighted mean of qsec
, this would simplify
to
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp) |>
group_by(cyl, vs, am) |>
fmean()
# cyl vs am mpg carb hp
# 1 4 0 1 26.00000 2.000000 91.00000
# 2 4 1 0 22.90000 1.666667 84.66667
# 3 4 1 1 28.37143 1.428571 80.57143
# 4 6 0 1 20.56667 4.666667 131.66667
# 5 6 1 0 19.12500 2.500000 115.25000
# 6 8 0 0 15.98000 2.900000 191.00000
# 7 8 0 1 15.40000 6.000000 299.50000
Finally, we could set the following options to toggle unsorted grouping, no missing value skipping, and multithreading across the three columns for more efficient execution.
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp) |>
group_by(cyl, vs, am, sort = FALSE) |>
fmean(nthreads = 3, na.rm = FALSE)
# cyl vs am mpg carb hp
# 1 6 0 1 20.56667 4.666667 131.66667
# 2 4 1 1 28.37143 1.428571 80.57143
# 3 6 1 0 19.12500 2.500000 115.25000
# 4 8 0 0 15.98000 2.900000 191.00000
# 5 4 1 0 22.90000 1.666667 84.66667
# 6 4 0 1 26.00000 2.000000 91.00000
# 7 8 0 1 15.40000 6.000000 299.50000
Setting these options globally using
set_collapse(sort = FALSE, nthreads = 3, na.rm = FALSE)
avoids the need to set them repeatedly.
Using Internal Grouping
Another key to writing efficient code with collapse is to
avoid fgroup_by()
where possible, especially for mutate
operations. collapse does not implement .by
arguments to manipulation functions like dplyr, but instead
allows ad-hoc grouped transformations through its statistical functions.
For example, the easiest and fastest way to computed the median of
mpg
by cyl
, vs
, and
am
is
mtcars |>
mutate(mpg_median = fmedian(mpg, list(cyl, vs, am), TRA = "fill")) |>
head(3)
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_median
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 30.4
For the common case of averaging and centering data,
collapse also provides functions fbetween()
for
averaging and fwithin()
for centering, i.e.,
fbetween(mpg, list(cyl, vs, am))
is the same as
fmean(mpg, list(cyl, vs, am), TRA = "fill")
. There is also
fscale()
for (grouped) scaling and centering.
This also applies to multiple columns, where we can use
fmutate(across(...))
or ftransformv()
,
i.e.
mtcars |>
mutate(across(c(mpg, disp, qsec), fmedian, list(cyl, vs, am), TRA = "fill")) |>
head(2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 16.46 0 1 4 4
# Or
mtcars |>
transformv(c(mpg, disp, qsec), fmedian, list(cyl, vs, am), TRA = "fill") |>
head(2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 16.46 0 1 4 4
Of course, if we want to apply different functions using the same
grouping, fgroup_by()
is sensible, but for mutate
operations it also has the argument return.groups = FALSE
,
which avoids materializing the unique grouping columns, saving some
memory.
mtcars |>
group_by(cyl, vs, am, return.groups = FALSE) |>
mutate(mpg_median = fmedian(mpg),
mpg_mean = fmean(mpg), # Or fbetween(mpg)
mpg_demean = fwithin(mpg), # Or fmean(mpg, TRA = "-")
mpg_scale = fscale(mpg),
.keep = "used") |>
ungroup() |>
head(3)
# mpg cyl vs am mpg_median mpg_mean mpg_demean mpg_scale
# Mazda RX4 21.0 6 0 1 21.0 20.56667 0.4333333 0.5773503
# Mazda RX4 Wag 21.0 6 0 1 21.0 20.56667 0.4333333 0.5773503
# Datsun 710 22.8 4 1 1 30.4 28.37143 -5.5714286 -1.1710339
The TRA
argument supports a whole array of operations,
see ?TRA
. For example fsum(mtcars, TRA = "/")
turns the column vectors into proportions. As an application of this,
consider a generated dataset of sector-level exports.
# c = country, s = sector, y = year, v = value
exports <- expand.grid(c = paste0("c", 1:8), s = paste0("s", 1:8), y = 1:15) |>
mutate(v = round(abs(rnorm(length(c), mean = 5)), 2)) |>
subset(-sample.int(length(v), 360)) # Making it unbalanced and irregular
head(exports)
# c s y v
# 1 c2 s1 1 5.55
# 2 c3 s1 1 4.33
# 3 c4 s1 1 5.21
# 4 c5 s1 1 5.31
# 5 c6 s1 1 6.17
# 6 c7 s1 1 5.62
nrow(exports)
# [1] 600
It is very easy then to compute Balassa’s (1965) Revealed Comparative Advantage (RCA) index, which is the share of a sector in country exports divided by the share of the sector in world exports. An index above 1 indicates that a RCA of country c in sector s.
# Computing Balassa's (1965) RCA index: fast and memory efficient
# settfm() modifies exports and assigns it back to the global environment
settfm(exports, RCA = fsum(v, list(c, y), TRA = "/") %/=% fsum(v, list(s, y), TRA = "/"))
Note that this involved a single expression with two different
grouped operations, which is only possible by incorporating grouping
into statistical functions themselves. Let’s summarise this dataset
using pivot()
to aggregate the RCA index across years. Here
"mean"
calls a highly efficient internal mean function.
pivot(exports, ids = "c", values = "RCA", names = "s",
how = "wider", FUN = "mean", sort = TRUE)
# c s1 s2 s3 s4 s5 s6 s7 s8
# 1 c1 0.9327521 0.9087815 0.9434970 1.105864 1.158613 0.9579166 1.1094150 1.218718
# 2 c2 1.4989832 1.0502050 0.8113781 1.024990 1.103707 1.1494829 1.0681358 1.021685
# 3 c3 1.0403483 0.9580809 0.8358023 1.024633 1.192487 0.9333733 1.0719161 1.010648
# 4 c4 0.9771630 1.0265800 0.9293951 1.007469 1.052942 0.9285248 1.4031524 1.027218
# 5 c5 0.9807908 1.1023470 0.8480027 1.080013 1.072168 0.9704144 1.1817784 1.099050
# 6 c6 0.9819940 1.1434701 0.9122508 1.164649 1.193275 0.9322847 0.9929571 1.177062
# 7 c7 1.1542193 1.1939893 0.7462051 1.109936 1.438044 1.0482547 1.5907867 1.055214
# 8 c8 1.4220817 1.2235288 0.7090515 1.189408 1.119605 1.3108897 1.3264848 1.279526
We may also wish to investigate the growth rate of RCA. This can be
done using fgrowth()
. Since the panel is irregular, i.e.,
not every sector is observed in every year, it is critical to also
supply the time variable.
exports |>
mutate(RCA_growth = fgrowth(RCA, g = list(c, s), t = y)) |>
pivot(ids = "c", values = "RCA_growth", names = "s",
how = "wider", FUN = fmedian, sort = TRUE)
# c s1 s2 s3 s4 s5 s6 s7 s8
# 1 c1 NA 29.87093 56.837880 0.3513705 11.9750588 6.356499 5.186966 3.4725766
# 2 c2 -19.092254 -10.72516 50.412427 8.7380006 -25.7119274 -17.958011 -36.853824 -30.5827161
# 3 c3 -3.904880 -29.72276 4.338254 4.2112875 13.8705938 -27.368230 -5.214542 -10.4867005
# 4 c4 0.639523 19.74757 -9.602120 9.7104112 42.0912878 17.583594 -27.915967 -18.1145784
# 5 c5 8.184523 18.93554 -5.333235 1.5243547 -0.3306585 8.682935 -15.678443 18.3991608
# 6 c6 12.606978 67.07558 19.270685 43.8243108 -25.0283737 -21.785028 -10.059702 0.7774246
# 7 c7 24.400344 48.56792 27.552571 -16.9311897 -6.6046775 -28.627885 -12.092345 24.5298895
# 8 c8 158.342022 17.99249 -61.857965 36.3372079 0.2085139 -2.178978 -18.666774 -40.5714063
Lastly, since the panel is unbalanced, we may wish to create an RCA index for only the last year, but balance the dataset a bit more by taking the last available trade within the last three years. This can be done using a single subset call
# Taking the latest observation within the last 3 years
exports_latest <- subset(exports, y > 12 & y == fmax(y, list(c, s), "fill"), -y)
# How many sectors do we observe for each country in the last 3 years?
with(exports_latest, fndistinct(s, c))
# c1 c2 c3 c4 c5 c6 c7 c8
# 8 8 7 7 8 8 6 8
We can then compute the RCA index on this data
exports_latest |>
mutate(RCA = fsum(v, c, TRA = "/") %/=% fsum(v, s, TRA = "/")) |>
pivot("c", "RCA", "s", how = "wider", sort = TRUE)
# c s1 s2 s3 s4 s5 s6 s7 s8
# 1 c1 0.9038055 0.9073996 0.7608879 0.5752643 0.8558140 0.6619450 0.8820296 0.9617336
# 2 c2 1.1725178 1.1771805 0.9871092 0.7462973 1.1102578 0.8587493 1.1442677 1.2476687
# 3 c3 1.2072861 1.2120870 1.0163796 NA 1.1431799 0.8842135 1.1781982 1.2846653
# 4 c4 1.2438173 1.2487635 1.0471341 0.7916788 1.1777713 NA 1.2138493 1.3235380
# 5 c5 1.0014055 1.0053877 0.8430546 0.6373858 0.9482314 0.7334270 0.9772781 1.0655891
# 6 c6 1.0234618 1.0275317 0.8616232 0.6514245 0.9691166 0.7495810 0.9988030 1.0890591
# 7 c7 1.3447625 1.3501101 NA NA 1.2733564 0.9849009 1.3123624 1.4309531
# 8 c8 1.1226366 1.1271008 0.9451155 0.7145483 1.0630252 0.8222164 1.0955882 1.1945903
To summarise, collapse provides many options for ad-hoc or
limited grouping, which are faster than a full fgroup_by()
,
and also syntactically efficient. Further efficiency gains are possible
using operations by reference, e.g., %/=%
instead of
/
to avoid an intermediate copy. It is also possible to
transform by reference using fast statistical functions by passing the
set = TRUE
argument, e.g.,
with(mtcars, fmean(mpg, cyl, TRA = "fill", set = TRUE))
replaces mpg
by its group-averaged version (the transformed
vector is returned invisibly).
Conclusion
collapse enhances R both statistically and computationally and is a good option for tidyverse users searching for more efficient and lightweight solutions to data manipulation and statistical computing problems in R. For more information, I recommend starting with the short vignette on Documentation Resources.
R users willing to write efficient/lightweight code and completely replace the tidyverse in their workflow are also encouraged to closely examine the fastverse suite of packages. collapse alone may not always suffice, but 99% of tidyverse code can be replaced with an efficient and lightweight fastverse solution.