# collapse's Handling of R Objects

### A Quick View Behind the Scenes of Class-Agnostic R Programming

#### Sebastian Krantz

#### 2024-11-03

Source:`vignettes/collapse_object_handling.Rmd`

`collapse_object_handling.Rmd`

This much-requested vignette provides some details about how
*collapse* deals with various R objects. It is principally a
digest of cumulative details provided in the NEWS for
various releases since v1.4.0.

## Overview

*collapse* is a class-agnostic programming framework that can
deal with a broad range of R objects. It provides explicit support for
base R classes and data types (*integer*, *double*,
*character*, *logical*, *list*,
*data.frame*, *matrix*, *factor*, *Date*,
*POSIXct*, *ts*), as well as *data.table*,
*tibble*, *grouped_df*, *xts*, *zoo*,
*pseries*, *pdata.frame*, *units*, and *sf*
(no geometric operations).

It also introduces *GRP_df*
as a more performant and class-agnostic grouped data frame, and *indexed_series*
and *indexed_frame* classes as modern class-agnostic
successors of *pseries*, *pdata.frame*. These objects
inherit the classes they succeed and are handled through
`.pseries`

, `.pdata.frame`

, and
`.grouped_df`

methods, which also support the original
(*plm* / *dplyr*) implementations (more details at the
end).

All other objects are handled internally at the C or R level using
general principles extended by specific considerations for some of the
above classes. I start with summarizing the general principles, which
enable the usage of *collapse* with further classes it was not
designed for.

## General Principles

In general, in *collapse*, attributes and classes of R objects
are preserved in statistical and data manipulation operations unless
their preservation involves a **high-risk** of yielding
something wrong/useless. Risky operations are those that change the
dimensions or internal data type (`typeof()`

) of an R
object.

To *collapse*’s R and C code, there exist 3 principal types of
objects: atomic vectors, matrices, and lists - which are often assumed
to be data frames. Most data manipulation functions in *collapse*
like `fmutate()`

only support lists, whereas statistical
functions - notably the S3 generic *Fast
Statistical Functions* like `fmean()`

- generally
support all 3 types of objects.

S3 generic functions initially dispatch to `.default`

,
`.matrix`

, `.data.frame`

, and (hidden)
`.list`

methods.
The `.list`

method generally dispatches to the
`.data.frame`

method. These basic methods, and other
non-generic functions in *collapse*, then decide how exactly to
handle the object based on the statistical operation performed and
attribute handling principles mostly implemented in C.

The simplest case arises when an operation preserves the dimensions
of the object, such as `fscale(x)`

or
`fmutate(data, across(a:c, log))`

. In this case, all
attributes of `x / data`

are preserved^{1}.

Another simple case for matrices and lists arises when a statistical
operation reduces them to a single dimension such as
`fmean(x)`

, where, under the `drop = TRUE`

default
of *Fast
Statistical Functions*, all attributes apart from (column-)names
are dropped and a (named) vector of means is returned.

For atomic vectors, a statistical operation like
`fmean(x)`

will preserve the attributes (except for
*ts* objects), as the object could have useful properties such as
labels or units.

More complex cases involve changing the dimensions (number of rows or
columns) of an object. If the number of rows is preserved
e.g. `fmutate(data, a_b = a / b)`

or
`flag(x, -1:1)`

, only the (column-)names attribute of the
object is modified. If the number of rows is reduced
e.g. `fmean(x, g)`

, all attributes are also retained under
suitable modifications of the (row-)names attribute. However, if
`x`

is a matrix, other attributes than row- or column-names
are only retained if `!is.object(x)`

, that is, if the matrix
does not have a ‘class’ attribute. For atomic vectors, attributes are
retained if `!inherits(x, "ts")`

, as aggregating a time
series will break the class. This also applies to columns in a data
frame being aggregated.

When data is transformed using statistics as provided by the `TRA()`

function e.g. `TRA(x, STATS, operation, groups)`

and the
like-named argument to the *Fast
Statistical Functions*, operations that simply modify the input
(`x`

) in a statistical sense (`"replace_NA"`

,
`"-"`

, `"-+"`

, `"/"`

, `"+"`

,
`"*"`

, `"%%"`

, `"-%%"`

) just copy the
attributes to the transformed object. Operations
`"replace_fill"`

and `"replace"`

are more tricky,
since here `x`

is replaced with `STATS`

, which
could be of a different class or data type. The following rules apply:
(1) the result has the same data type as `STATS`

; (2) if
`is.object(STATS)`

, the attributes of `STATS`

are
preserved; (3) otherwise the attributes of `x`

are preserved
unless `is.object(x) && typeof(x) != typeof(STATS)`

;
(4) an exemption to this rule is made if `x`

is a factor and
an integer replacement is offered to STATS
e.g. `fnobs(factor, group, TRA = "replace_fill")`

. In that
case, the attributes of `x`

are copied except for the ‘class’
and ‘levels’ attributes. These rules were devised considering the
possibility that `x`

may have important information attached
to it which should be preserved in data transformations, such as a
`"label"`

attribute.

Another rather complex case arises when manipulating data with
*collapse* using base R functions,
e.g. `BY(mtcars$mpg, mtcars$cyl, mad)`

or
`mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mad_mpg = mad(mpg))`

.
In this case, *collapse* internally uses base R functions
`lapply`

and `unlist()`

, following efficient
splitting with `gsplit()`

(which preserves all attributes).
Concretely, the result is computed as
`y = unlist(lapply(gsplit(x, g), FUN, ...), FALSE, FALSE)`

,
where in the examples `x`

is `mtcars$mpg`

,
`g`

is the grouping variable(s), `FUN = mad`

, and
`y`

is `mad(x)`

in each group. To follow its
policy of attribute preservation as closely as possible,
*collapse* then calls an internal function
`y_final = copyMostAttributes(y, x)`

, which copies the
attributes of `x`

to `y`

if both are deemed
compatible^{2}
($\approx$
of the same data type). If they are deemed incompatible,
`copyMostAttributes`

still checks if `x`

has a
`"label"`

attribute and copies that one to
`y`

.

So to summarize the general principles: *collapse* just tries
to preserve attributes in all cases except for where it is likely to
break something, beholding the way most commonly used R classes and
objects behave. The most likely operations that break something are when
aggregating matrices which have a class (such as
*mts*/*xts*) or univariate time series (*ts*), when
data is to be replaced by another object, or when applying an unknown
function to a vector by groups and assembling the result with
`unlist()`

. In the latter cases, particular attention is paid
to integer vectors and factors, as we often count something generating
integers, and malformed factors need to be avoided.

The following section provides some further details for some
*collapse* functions and supported classes.

## Specific Functions and Classes

#### Object Conversions

Quick
conversion functions `qDF`

, `qDT`

,
`qTBL()`

and `qM`

(to create data.frame’s,
*data.table*’s, *tibble*’s and matrices from arbitrary R
objects) by default (`keep.attr = FALSE`

) perform very strict
conversions, where all attributes non-essential to the class are dropped
from the input object. This is to ensure that, following conversion,
objects behave exactly the way users expect. This is different from the
behavior of functions like `as.data.frame()`

,
`as.data.table()`

, `as_tibble()`

or
`as.matrix()`

e.g. `as.matrix(EuStockMarkets)`

just returns `EuStockMarkets`

whereas
`qM(EuStockMarkets)`

returns a plain matrix without time
series attributes. This behavior can be changed by setting
`keep.attr = TRUE`

,
i.e. `qM(EuStockMarkets, keep.attr = TRUE)`

.

#### Selecting Columns by Data Type

Functions `num_vars()`

,
`cat_vars()`

(the opposite of `num_vars()`

),
`char_vars()`

etc. are implemented in C to quickly select
columns by data type without the need to check data frame columns by
applying a function such as `is.numeric()`

in R. For
`is.numeric`

, the C implementation is equivalent to
`is_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && (!is.object(x) || inherits(x, "ts") || inherits(x, "units") || inherits(x, "integer64"))`

.
This of course does not respect the behavior of other classes that
define methods for `is.numeric`

e.g. `is.numeric.foo <- function(x) TRUE`

, then for
`y = structure(rnorm(100), class = "foo")`

,
`is.numeric(y)`

is `TRUE`

but
`num_vars(data.frame(y))`

returns an empty frame. Correct
behavior in this case requires
`get_vars(data.frame(y), is.numeric)`

. A particular case to
be aware of here is when using `collap()`

with the
`FUN`

and `catFUN`

arguments, where the C code
(`is_numeric_C`

) is used internally to decide whether a
column is numeric or categorical. Thus numeric columns with a class
attribute other than “ts”, “units”, and “integer64” are not recognized
as numeric by `collap()`

. *collapse* also does not
support statistical operations on complex data.

#### Parsing of Time-IDs

*Time
Series Functions* `flag`

, `fdiff`

,
`fgrowth`

and `psacf/pspacf/psccf`

(and the
operators `L/F/D/Dlog/G`

) have a `t`

argument to
pass time-ids for fully identified temporal operations on time series
and panel data. If `t`

is a plain numeric vector or a factor,
it is coerced to integer using `as.integer()`

, and the
integer steps are used as time steps. This is premised on the
observation that the most common form of temporal identifier is a
numeric variable denoting calendar years. If on the other hand
`t`

is a numeric time object such that
`is.object(t) && is.numeric(unclass(t))`

(e.g. Date,
POSIXct, etc.), then it is passed through `timeid()`

which
computes the greatest common divisor of the vector and generates an
integer time-id in that way. Users are therefore advised to use
appropriate classes to represent time steps e.g. for monthly data
`zoo::yearmon`

would be appropriate. It is also possible to
pass non-numeric `t`

, such as character or list/data.frame.
In such cases ordered grouping is applied to generate an integer
time-id, but this should rather be avoided.

####
*xts*/*zoo* Time Series

*xts*/*zoo* time series are handled through
`.zoo`

methods to all relevant functions. These methods are
simple and all follow this pattern:
`FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ....)`

.
Thus the general principles apply. Time-Series function do not
automatically use the index for indexed computations, partly for
consistency with native methods where this is also not the case
(e.g. `lag.xts`

does not perform an indexed lag), and partly
because, as outlined above, the index does not necessarily accurately
reflect the time structure. Thus the user must exercise discretion to
perform an indexed lag on *xts*/*zoo*. For example:
`flag(xts_daily, 1:3, t = index(xts_daily))`

or
`flag(xts_monthly, 1:3, t = zoo::as.yearmon(index(xts_monthly)))`

.

#### Support for *sf* and *units*

*collapse* internally supports *sf* by seeking to avoid
undue destruction of *sf* objects through removal of the
‘geometry’ column in data manipulation operations. This is simply
implemented through an additional check in the C programs used to subset
columns of data: if the object is an *sf* data frame, the
‘geometry’ column is added to the column selection. Other functions like
`funique()`

or `roworder()`

have internal
facilities to avoid sorting or grouping on the ‘geometry’ column. Again
other functions like `descr()`

and `qsu()`

simply
omit the geometry column in their statistical calculations. A short vignette
describes the integration of *collapse* and *sf* in a bit
more detail. In summary: *collapse* supports *sf* by
seeking to appropriately deal with the ‘geometry’ column. It cannot
perform any geometrical operations, for example after subsetting
*sf* with `fsubset()`

, the bounding box attribute of
the geometry is unaltered and likely too large.

Regarding *units* objects, all relevant functions also have
simple methods of the form
`FUN.units <- function(x, ...) copyMostAttrib(if(is.matrix(x)) FUN.matrix(x, ...), x) else FUN.default(x, ....)`

.
According to the general principles, the default method preserves the
units class, whereas the matrix method does not if `FUN`

aggregates the data. The use of `copyMostAttrib()`

, which
copies all attributes apart from `"dim"`

,
`"dimnames"`

, and `"names"`

, ensures that the
returned objects are still *units* objects.

#### Support for *data.table*

*collapse* provides quite thorough support for
*data.table*. The simplest level of support is that it avoids
assigning descriptive (character) row names to *data.table*’s
e.g. `fmean(mtcars, mtcars$cyl)`

has row-names corresponding
to the groups but `fmean(qDT(mtcars), mtcars$cyl)`

does
not.

*collapse* further supports *data.table*’s reference
semantics (`set*`

, `:=`

). To be able to add
columns by reference (e.g. `DT[, new := 1]`

),
*data.table*’s are implemented as overallocated lists^{3}.
*collapse* copied some C code from *data.table* to do the
overallocation and generate the `".internal.selfref"`

attribute, so that `qDT()`

creates a valid and fully
functional *data.table*. To enable seamless data manipulation
combining *collapse* and *data.table*, all data
manipulation functions in *collapse* call this C code at the end
and return a valid (overallocated) *data.table*. However, because
this overallocation comes at a computational cost of 2-3 microseconds, I
have opted against also adding it to the `.data.frame`

methods of statistical functions. Concretely, this means that
`res <- DT |> fgroup_by(id) |> fsummarise(mu_a = fmean(a))`

gives a fully functional *data.table*
i.e. `res[, new := 1]`

works, but
`res2 <- DT |> fgroup_by(id) |> fmean()`

gives a
non-overallocated *data.table* such that
`res2[, new := 1]`

will still work but issue a warning. In
this case,
`res2 <- DT |> fgroup_by(id) |> fmean() |> qDT()`

can be used to avoid the warning. This, to me, seems a reasonable
trade-off between flexibility and performance. More details and examples
are provided in the *collapse*
and *data.table* vignette.

#### Class-Agnostic Grouped and Indexed Data Frames

As indicated in the introductory remarks, *collapse* provides
a fast class-agnostic
grouped data frame created with `fgroup_by()`

, and fast
class-agnostic
indexed time series and panel data, created with
`findex_by()`

/`reindex()`

. Class-agnostic means
that the object that is grouped/indexed continues to behave as before
except in *collapse* operations utilizing the ‘groups’/‘index_df’
attributes.

The grouped data frame is implemented as follows:
`fgroup_by()`

saves the class of the input data, calls
`GRP()`

on the columns being grouped, and attaches the
resulting ‘GRP’ object in a `"groups"`

attribute. It then
assigns a class attribute as follows

```
clx <- class(.X) # .X is the data frame being grouped, clx is its class
m <- match(c("GRP_df", "grouped_df", "data.frame"), clx, nomatch = 0L)
class(.X) <- c("GRP_df", if(length(mp <- m[m != 0L])) clx[-mp] else clx, "grouped_df", if(m[3L]) "data.frame")
```

In words: a class `"GRP_df"`

is added in front, followed
by the classes of the original object^{4}, followed by `"grouped_df"`

and
finally `"data.frame"`

, if present. The first class
`"GRP_df"`

is for dealing appropriately with the object
through methods for `print()`

and subsetting (`[`

,
`[[`

), e.g. `print.GRP_df`

fetches the grouping
object, prints `fungroup(.X)`

^{5}, and then prints a
summary of the grouping. `[.GRP_df`

works similarly: it saves
the groups, calls `[`

on `fungroup(.X)`

, and
attaches the groups again if the result is a list with the same number
of rows. So *collapse* has no issues printing and handling
grouped *data.table*’s, *tibbles*, *sf* data
frames, etc.: they continue to behave as usual. Now *collapse*
has various functions with a `.grouped_df`

method to deal
with grouped data frames. For example `fmean.grouped_df`

, in
a nutshell, fetches the attached ‘GRP’ object using
`GRP.grouped_df`

, and calls `fmean.data.frame`

on
`fungroup(data)`

, passing the ‘GRP’ object to the
`g`

argument for grouped computation. Here the general
principles outlined above apply so that the resulting object has the
same classes and attributes as the original one.

This architecture has an additional advantage: it allows
`GRP.grouped_df`

to examine the grouping object and check if
it was created by *collapse* (class ‘GRP’) or by *dplyr*.
If the latter is the case, an efficient C routine is called to convert
the *dplyr* grouping object to a ‘GRP’ object so that all
`.grouped_df`

methods in *collapse* apply to data
frames created with either `dplyr::group_by()`

or
`fgroup_by()`

.

The *indexed_frame* works more or less in the same way. It
inherits from *pdata.frame* so that `.pdata.frame`

methods in *collapse* deal with both *indexed_frame*’s of
arbitrary classes and *pdata.frame*’s created with
*plm*.

A notable difference to both *grouped_df* and
*pdata.frame* is that *indexed_frame* is a deeply indexed
data structure: each variable inside an *indexed_frame* is an
*indexed_series* which contains in its *index_df*
attribute an external pointer to the *index_df* attribute of the
frame. Functions with *pseries* methods operating on
*indexed_series* stored inside the frame (such as
`with(data, flag(column))`

) can fetch the index from this
pointer. This allows worry-free application inside arbitrary data
masking environments (`with`

, `%$%`

,
`attach`

, etc..) and estimation commands (`glm`

,
`feols`

, `lmrob`

etc..) without duplication of the
index in memory. As you may have guessed, *indexed_series* are
also class-agnostic and inherit from *pseries*. Any vector or
matrix of any class can become an *indexed_series*.

Further levels of generality are that indexed series and frames allow
one, two or more variables in the index to support both time series and
complex panels, natively deal with irregularity in time^{6}, and provide a rich
set of methods for subsetting and manipulation which also subset the
*index_df* attribute, including internal methods for
`fsubset()`

, `funique()`

, `roworder(v)`

and `na_omit()`

. So *indexed_frame* and
*indexed_series* is a rich and general structure permitting fully
time-aware computations on nearly any R object. See `?indexing`

for more information.

## Conclusion

*collapse* handles R objects in a preserving and fairly
intelligent manner, allowing seamless compatibility with many common
data classes in R, and statistical workflows that preserve attributes
(labels, units, etc.) of the data. This is done by general principles
and some specific considerations/exemptions mostly implemented in C - as
detailed in this vignette.

The main benefits of this design are generality and execution speed:
*collapse* has much fewer R-level method dispatches and function
calls than other frameworks used to perform statistical or data
manipulation operations, it behaves predictably, and may also work well
with your simple new class.

The main disadvantage is that many of the general principles and
exemptions are hard-coded in C and thus may not work with specific
classes. A prominent example where *collapse* simply fails is
*lubridate*’s *interval* class (#186, #418), which
has a `"starts"`

attribute of the same length as the data
that is preserved but not subset in *collapse* operations.