Data Transformations
data-transformations.Rd
collapse provides an ensemble of functions to perform common data transformations efficiently and user friendly:
dapply
applies functions to rows or columns of matrices and data frames, preserving the data format.BY
is an S3 generic for efficient Split-Apply-Combine computing, similar todapply
.A set of arithmetic operators facilitates row-wise
%rr%
,%r+%
,%r-%
,%r*%
,%r/%
and column-wise%cr%
,%c+%
,%c-%
,%c*%
,%c/%
replacing and sweeping operations involving a vector and a matrix or data frame / list. Since v1.7, the operators%+=%
,%-=%
,%*=%
and%/=%
do column- and element- wise math by reference, and the functionsetop
can also perform sweeping out rows by reference.(set)TRA
is a more advanced S3 generic to efficiently perform (groupwise) replacing and sweeping out of statistics, either by creating a copy of the data or by reference. Supported operations are:Integer-id String-id Description 0 "na" or "replace_na" replace only missing values 1 "fill" or "replace_fill" replace everything 2 "replace" replace data but preserve missing values 3 "-" subtract 4 "-+" subtract group-statistics but add group-frequency weighted average of group statistics 5 "/" divide 6 "%" compute percentages 7 "+" add 8 "*" multiply 9 "%%" modulus 10 "-%%" subtract modulus All of collapse's Fast Statistical Functions have a built-in
TRA
argument for faster access (i.e. you can compute (groupwise) statistics and use them to transform your data with a single function call).fscale/STD
is an S3 generic to perform (groupwise and / or weighted) scaling / standardizing of data and is orders of magnitude faster thanscale
.fwithin/W
is an S3 generic to efficiently perform (groupwise and / or weighted) within-transformations / demeaning / centering of data. Similarlyfbetween/B
computes (groupwise and / or weighted) between-transformations / averages (also a lot faster thanave
).fhdwithin/HDW
, shorthand for 'higher-dimensional within transform', is an S3 generic to efficiently center data on multiple groups and partial-out linear models (possibly involving many levels of fixed effects and interactions). In other words,fhdwithin/HDW
efficiently computes residuals from linear models. Similarlyfhdbetween/HDB
, shorthand for 'higher-dimensional between transformation', computes the corresponding means or fitted values.flag/L/F
,fdiff/D/Dlog
andfgrowth/G
are S3 generics to compute sequences of lags / leads and suitably lagged and iterated (quasi-, log-) differences and growth rates on time series and panel data.fcumsum
flexibly computes (grouped, ordered) cumulative sums. More in Time Series and Panel Series.STD, W, B, HDW, HDB, L, D, Dlog
andG
are parsimonious wrappers around thef-
functions above representing the corresponding transformation 'operators'. They have additional capabilities when applied to data-frames (i.e. variable selection, formula input, auto-renaming and id-variable preservation), and are easier to employ in regression formulas, but are otherwise identical in functionality.
Table of Functions
Function / S3 Generic | Methods | Description | ||
dapply | No methods, works with matrices and data frames | Apply functions to rows or columns | ||
BY | default, matrix, data.frame, grouped_df | Split-Apply-Combine computing | ||
%(r/c)(r/+/-/*//)% | No methods, works with matrices and data frames / lists | Row- and column-arithmetic | ||
(set)TRA | default, matrix, data.frame, grouped_df | Replace and sweep out statistics (by reference) | ||
fscale/STD | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Scale / standardize data | ||
fwithin/W | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Demean / center data | ||
fbetween/B | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Compute means / average data | ||
fhdwithin/HDW | default, matrix, data.frame, pseries, pdata.frame | High-dimensional centering and lm residuals | ||
fhdbetween/HDB | default, matrix, data.frame, pseries, pdata.frame | High-dimensional averages and lm fitted values | ||
flag/L/F , fdiff/D/Dlog , fgrowth/G , fcumsum | default, matrix, data.frame, pseries, pdata.frame, grouped_df | (Sequences of) lags / leads, differences, growth rates and cumulative sums |