Data Transformations

collapse provides an ensemble of functions to perform common data transformations efficiently and user friendly:

dapply applies functions to rows or columns of matrices and data frames, preserving the data format.
BY is an S3 generic for efficient Split-Apply-Combine computing, similar to dapply.
A set of arithmetic operators facilitates row-wise %rr%, %r+%, %r-%, %r*%, %r/% and column-wise %cr%, %c+%, %c-%, %c*%, %c/% replacing and sweeping operations involving a vector and a matrix or data frame / list. Since v1.7, the operators %+=%, %-=%, %*=% and %/=% do column- and element- wise math by reference, and the function setop can also perform sweeping out rows by reference.

(set)TRA is a more advanced S3 generic to efficiently perform (groupwise) replacing and sweeping out of statistics, either by creating a copy of the data or by reference. Supported operations are:

Integer-id	String-id	Description
0	"na" or "replace_na"	replace only missing values
1	"fill" or "replace_fill"	replace everything
2	"replace"	replace data but preserve missing values
3	"-"	subtract
4	"-+"	subtract group-statistics but add group-frequency weighted average of group statistics
5	"/"	divide
6	"%"	compute percentages
7	"+"	add
8	"*"	multiply
9	"%%"	modulus
10	"-%%"	subtract modulus

All of collapse's Fast Statistical Functions have a built-in TRA argument for faster access (i.e. you can compute (groupwise) statistics and use them to transform your data with a single function call).

fscale/STD is an S3 generic to perform (groupwise and / or weighted) scaling / standardizing of data and is orders of magnitude faster than scale.
fwithin/W is an S3 generic to efficiently perform (groupwise and / or weighted) within-transformations / demeaning / centering of data. Similarly fbetween/B computes (groupwise and / or weighted) between-transformations / averages (also a lot faster than ave).
fhdwithin/HDW, shorthand for 'higher-dimensional within transform', is an S3 generic to efficiently center data on multiple groups and partial-out linear models (possibly involving many levels of fixed effects and interactions). In other words, fhdwithin/HDW efficiently computes residuals from linear models. Similarly fhdbetween/HDB, shorthand for 'higher-dimensional between transformation', computes the corresponding means or fitted values.
flag/L/F, fdiff/D/Dlog and fgrowth/G are S3 generics to compute sequences of lags / leads and suitably lagged and iterated (quasi-, log-) differences and growth rates on time series and panel data. fcumsum flexibly computes (grouped, ordered) cumulative sums. More in Time Series and Panel Series.
STD, W, B, HDW, HDB, L, D, Dlog and G are parsimonious wrappers around the f- functions above representing the corresponding transformation 'operators'. They have additional capabilities when applied to data-frames (i.e. variable selection, formula input, auto-renaming and id-variable preservation), and are easier to employ in regression formulas, but are otherwise identical in functionality.

Table of Functions

Function / S3 Generic	Methods	Description
`dapply`	No methods, works with matrices and data frames	Apply functions to rows or columns
`BY`	`default, matrix, data.frame, grouped_df`	Split-Apply-Combine computing
`%(r/c)(r/+/-/*//)%`	No methods, works with matrices and data frames / lists	Row- and column-arithmetic
`(set)TRA`	`default, matrix, data.frame, grouped_df`	Replace and sweep out statistics (by reference)
`fscale/STD`	`default, matrix, data.frame, pseries, pdata.frame, grouped_df`	Scale / standardize data
`fwithin/W`	`default, matrix, data.frame, pseries, pdata.frame, grouped_df`	Demean / center data
`fbetween/B`	`default, matrix, data.frame, pseries, pdata.frame, grouped_df`	Compute means / average data
`fhdwithin/HDW`	`default, matrix, data.frame, pseries, pdata.frame`	High-dimensional centering and lm residuals
`fhdbetween/HDB`	`default, matrix, data.frame, pseries, pdata.frame`	High-dimensional averages and lm fitted values
`flag/L/F`, `fdiff/D/Dlog`, `fgrowth/G`, `fcumsum`	`default, matrix, data.frame, pseries, pdata.frame, grouped_df`	(Sequences of) lags / leads, differences, growth rates and cumulative sums

Table of Functions

See also