Skip to contents

collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are:

  • To facilitate complex data transformation, exploration and computing tasks in R.

  • To help make R code fast, flexible, parsimonious and programmer friendly.

It is made compatible with the tidyverse, data.table, sf, units, xts/zoo, and the plm approach to panel data.

Getting Started

Read the short vignette on documentation resources, and check out the built in documentation.

Details

collapse provides an integrated suite of statistical and data manipulation functions that greatly extend and enhance the capabilities of base R. In a nutshell, collapse provides:

  • Fast C/C++ based (grouped, weighted) computations embedded in highly optimized R code.

  • More complex statistical, time series / panel data and recursive (list-processing) operations.

  • A flexible and generic approach supporting and preserving many R objects.

  • Optimized programming in standard and non-standard evaluation.

The statistical functions in collapse are S3 generic with core methods for vectors, matrices and data frames, and internally support grouped and weighted computations carried out in C/C++.

Additional methods and C-level features enable broad based compatibility with dplyr (grouped tibble), data.table, sf and plm panel data classes. Functions and core methods seek to preserve object attributes (including column attributes such as variable labels), ensuring flexibility and effective workflows with a very broad range of R objects (including most time-series classes). See also the vignette on collapse's handling of R objects.

Missing values are efficiently skipped at C/C++ level. The package default is na.rm = TRUE. This can be changed using set_collapse(na.rm = FALSE). Missing weights are generally supported.

collapse installs with a built-in hierarchical documentation facilitating the use of the package.

The package is coded both in C and C++ and built with Rcpp, but also uses C/C++ functions from data.table, kit, fixest, weights, stats and RcppArmadillo / RcppEigen.

Author(s)

Maintainer: Sebastian Krantz sebastian.krantz@graduateinstitute.ch

Other contributors from packages collapse utilizes:

  • Matt Dowle, Arun Srinivasan and contributors worldwide (data.table)

  • Dirk Eddelbuettel and contributors worldwide (Rcpp, RcppArmadillo, RcppEigen)

  • Morgan Jacob (kit)

  • Laurent Berge (fixest)

  • Josh Pasek (weights)

  • R Core Team and contributors worldwide (stats)

I thank many people from diverse fields for helpful answers on Stackoverflow, Joris Meys for encouraging me and helping to set up the GitHub repository for collapse, and many other people for feature requests and helpful suggestions.

Developing / Bug Reporting

Examples

## Note: this set of examples is is certainly non-exhaustive and does not
## showcase many recent features, but remains a very good starting point

## Let's start with some statistical programming
v <- iris$Sepal.Length
d <- num_vars(iris)    # Saving numeric variables
f <- iris$Species      # Factor

# Simple statistics
fmean(v)               # vector
#> [1] 5.843333
fmean(qM(d))           # matrix (qM is a faster as.matrix)
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#>     5.843333     3.057333     3.758000     1.199333 
fmean(d)               # data.frame
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#>     5.843333     3.057333     3.758000     1.199333 

# Preserving data structure
fmean(qM(d), drop = FALSE)     # Still a matrix
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]     5.843333    3.057333        3.758    1.199333
fmean(d, drop = FALSE)         # Still a data.frame
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.843333    3.057333        3.758    1.199333

# Weighted statistics, supported by most functions...
w <- abs(rnorm(fnrow(iris)))
fmean(d, w = w)
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#>     5.839603     3.086031     3.721517     1.199056 

# Grouped statistics...
fmean(d, f)
#>            Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa            5.006       3.428        1.462       0.246
#> versicolor        5.936       2.770        4.260       1.326
#> virginica         6.588       2.974        5.552       2.026

# Groupwise-weighted statistics...
fmean(d, f, w)
#>            Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa         5.035771    3.447914     1.453903   0.2653375
#> versicolor     5.930956    2.795360     4.255436   1.3213342
#> virginica      6.578510    2.990917     5.543057   2.0419757

# Simple Transformations...
head(fmode(d, TRA = "replace"))    # Replacing values with the mode
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1            5           3          1.5         0.2
#> 2            5           3          1.5         0.2
#> 3            5           3          1.5         0.2
#> 4            5           3          1.5         0.2
#> 5            5           3          1.5         0.2
#> 6            5           3          1.5         0.2
head(fmedian(d, TRA = "-"))        # Subtracting the median
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1         -0.7         0.5        -2.95        -1.1
#> 2         -0.9         0.0        -2.95        -1.1
#> 3         -1.1         0.2        -3.05        -1.1
#> 4         -1.2         0.1        -2.85        -1.1
#> 5         -0.8         0.6        -2.95        -1.1
#> 6         -0.4         0.9        -2.65        -0.9
head(fsum(d, TRA = "%"))           # Computing percentages
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1    0.5818597   0.7631923    0.2483591   0.1111729
#> 2    0.5590416   0.6541648    0.2483591   0.1111729
#> 3    0.5362236   0.6977758    0.2306191   0.1111729
#> 4    0.5248146   0.6759703    0.2660990   0.1111729
#> 5    0.5704507   0.7849978    0.2483591   0.1111729
#> 6    0.6160867   0.8504143    0.3015789   0.2223457
head(fsd(d, TRA = "/"))            # Dividing by the standard-deviation (scaling), etc...
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     6.158928    8.029986    0.7930671   0.2623854
#> 2     5.917402    6.882845    0.7930671   0.2623854
#> 3     5.675875    7.341701    0.7364195   0.2623854
#> 4     5.555112    7.112273    0.8497148   0.2623854
#> 5     6.038165    8.259414    0.7930671   0.2623854
#> 6     6.521218    8.947698    0.9630101   0.5247707

# Weighted Transformations...
head(fnth(d, 0.75, w = w, TRA = "replace"))  # Replacing by the weighted 3rd quartile
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          6.4         3.4          5.1         1.8
#> 2          6.4         3.4          5.1         1.8
#> 3          6.4         3.4          5.1         1.8
#> 4          6.4         3.4          5.1         1.8
#> 5          6.4         3.4          5.1         1.8
#> 6          6.4         3.4          5.1         1.8

# Grouped Transformations...
head(fvar(d, f, TRA = "replace"))  # Replacing values with the group variance
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     0.124249   0.1436898   0.03015918  0.01110612
#> 2     0.124249   0.1436898   0.03015918  0.01110612
#> 3     0.124249   0.1436898   0.03015918  0.01110612
#> 4     0.124249   0.1436898   0.03015918  0.01110612
#> 5     0.124249   0.1436898   0.03015918  0.01110612
#> 6     0.124249   0.1436898   0.03015918  0.01110612
head(fsd(d, f, TRA = "/"))         # Grouped scaling
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     14.46851    9.233260     8.061544    1.897793
#> 2     13.90112    7.914223     8.061544    1.897793
#> 3     13.33372    8.441838     7.485720    1.897793
#> 4     13.05003    8.178031     8.637369    1.897793
#> 5     14.18481    9.497068     8.061544    1.897793
#> 6     15.31960   10.288490     9.789018    3.795585
head(fmin(d, f, TRA = "-"))        # Setting the minimum value in each species to 0
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          0.8         1.2          0.4         0.1
#> 2          0.6         0.7          0.4         0.1
#> 3          0.4         0.9          0.3         0.1
#> 4          0.3         0.8          0.5         0.1
#> 5          0.7         1.3          0.4         0.1
#> 6          1.1         1.6          0.7         0.3
head(fsum(d, f, TRA = "/"))        # Dividing by the sum (proportions)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1   0.02037555  0.02042007   0.01915185  0.01626016
#> 2   0.01957651  0.01750292   0.01915185  0.01626016
#> 3   0.01877747  0.01866978   0.01778386  0.01626016
#> 4   0.01837795  0.01808635   0.02051984  0.01626016
#> 5   0.01997603  0.02100350   0.01915185  0.01626016
#> 6   0.02157411  0.02275379   0.02325581  0.03252033
head(fmedian(d, f, TRA = "-"))     # Groupwise de-median
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          0.1         0.1         -0.1         0.0
#> 2         -0.1        -0.4         -0.1         0.0
#> 3         -0.3        -0.2         -0.2         0.0
#> 4         -0.4        -0.3          0.0         0.0
#> 5          0.0         0.2         -0.1         0.0
#> 6          0.4         0.5          0.2         0.2
head(ffirst(d, f, TRA = "%%"))     # Taking modulus of first group-value, etc. ...
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          0.0         0.0          0.0           0
#> 2          4.9         3.0          0.0           0
#> 3          4.7         3.2          1.3           0
#> 4          4.6         3.1          0.1           0
#> 5          5.0         0.1          0.0           0
#> 6          0.3         0.4          0.3           0

# Grouped and weighted transformations...
head(fsd(d, f, w, "/"), 3)         # weighted scaling
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     14.24438    9.301969     8.285370    1.776243
#> 2     13.68577    7.973117     8.285370    1.776243
#> 3     13.12717    8.504658     7.693558    1.776243
head(fmedian(d, f, w, "-"), 3)     # subtracting the weighted group-median
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          0.1         0.1         -0.1           0
#> 2         -0.1        -0.4         -0.1           0
#> 3         -0.3        -0.2         -0.2           0
head(fmode(d, f, w, "replace"), 3) # replace with weighted statistical mode
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1            5         3.4          1.4         0.2
#> 2            5         3.4          1.4         0.2
#> 3            5         3.4          1.4         0.2

## Some more advanced transformations...
head(fbetween(d))                             # Averaging (faster t.: fmean(d, TRA = "replace"))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.843333    3.057333        3.758    1.199333
#> 2     5.843333    3.057333        3.758    1.199333
#> 3     5.843333    3.057333        3.758    1.199333
#> 4     5.843333    3.057333        3.758    1.199333
#> 5     5.843333    3.057333        3.758    1.199333
#> 6     5.843333    3.057333        3.758    1.199333
head(fwithin(d))                              # Centering (faster than: fmean(d, TRA = "-"))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1   -0.7433333  0.44266667       -2.358  -0.9993333
#> 2   -0.9433333 -0.05733333       -2.358  -0.9993333
#> 3   -1.1433333  0.14266667       -2.458  -0.9993333
#> 4   -1.2433333  0.04266667       -2.258  -0.9993333
#> 5   -0.8433333  0.54266667       -2.358  -0.9993333
#> 6   -0.4433333  0.84266667       -2.058  -0.7993333
head(fwithin(d, f, w))                        # Grouped and weighted (same as fmean(d, f, w, "-"))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1   0.06422856  0.05208646  -0.05390312 -0.06533746
#> 2  -0.13577144 -0.44791354  -0.05390312 -0.06533746
#> 3  -0.33577144 -0.24791354  -0.15390312 -0.06533746
#> 4  -0.43577144 -0.34791354   0.04609688 -0.06533746
#> 5  -0.03577144  0.15208646  -0.05390312 -0.06533746
#> 6   0.36422856  0.45208646   0.24609688  0.13466254
head(fwithin(d, f, w, mean = 5))              # Setting a custom mean
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.064229    5.052086     4.946097    4.934663
#> 2     4.864229    4.552086     4.946097    4.934663
#> 3     4.664229    4.752086     4.846097    4.934663
#> 4     4.564229    4.652086     5.046097    4.934663
#> 5     4.964229    5.152086     4.946097    4.934663
#> 6     5.364229    5.452086     5.246097    5.134663
head(fwithin(d, f, w, theta = 0.76))          # Quasi-centering i.e. d - theta*fbetween(d, f, w)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1    1.2728137   0.8795857    0.2950336 -0.00165647
#> 2    1.0728137   0.3795857    0.2950336 -0.00165647
#> 3    0.8728137   0.5795857    0.1950336 -0.00165647
#> 4    0.7728137   0.4795857    0.3950336 -0.00165647
#> 5    1.1728137   0.9795857    0.2950336 -0.00165647
#> 6    1.5728137   1.2795857    0.5950336  0.19834353
head(fwithin(d, f, w, mean = "overall.mean")) # Preserving the overall mean of the data
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.903831    3.138117     3.667614    1.133719
#> 2     5.703831    2.638117     3.667614    1.133719
#> 3     5.503831    2.838117     3.567614    1.133719
#> 4     5.403831    2.738117     3.767614    1.133719
#> 5     5.803831    3.238117     3.667614    1.133719
#> 6     6.203831    3.538117     3.967614    1.333719
head(fscale(d))                               # Scaling and centering
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1   -0.8976739  1.01560199    -1.335752   -1.311052
#> 2   -1.1392005 -0.13153881    -1.335752   -1.311052
#> 3   -1.3807271  0.32731751    -1.392399   -1.311052
#> 4   -1.5014904  0.09788935    -1.279104   -1.311052
#> 5   -1.0184372  1.24503015    -1.335752   -1.311052
#> 6   -0.5353840  1.93331463    -1.165809   -1.048667
head(fscale(d, mean = 5, sd = 3))             # Custom scaling and centering
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1    2.3069784    8.046806    0.9927451    1.066844
#> 2    1.5823985    4.605384    0.9927451    1.066844
#> 3    0.8578187    5.981953    0.8228021    1.066844
#> 4    0.4955288    5.293668    1.1626881    1.066844
#> 5    1.9446885    8.735090    0.9927451    1.066844
#> 6    3.3938481   10.799944    1.5025740    1.854000
head(fscale(d, mean = FALSE, sd = 3))         # Mean preserving scaling
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     3.150312    6.104139  -0.24925490   -2.733823
#> 2     2.425732    2.662717  -0.24925490   -2.733823
#> 3     1.701152    4.039286  -0.41919786   -2.733823
#> 4     1.338862    3.351001  -0.07931195   -2.733823
#> 5     2.788022    6.792424  -0.24925490   -2.733823
#> 6     4.237181    8.857277   0.26057397   -1.946667
head(fscale(d, f, w))                         # Grouped and weighted scaling and centering
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1   0.17939134   0.1384305   -0.3190052   -0.580276
#> 2  -0.37921170  -1.1904223   -0.3190052   -0.580276
#> 3  -0.93781474  -0.6588812   -0.9108174   -0.580276
#> 4  -1.21711626  -0.9246517    0.2728070   -0.580276
#> 5  -0.09991018   0.4042010   -0.3190052   -0.580276
#> 6   1.01729590   1.2015127    1.4564313    1.195967
head(fscale(d, f, w, mean = 5, sd = 3))       # Custom grouped and weighted scaling and centering
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.538174    5.415291     4.042984    3.259172
#> 2     3.862365    1.428733     4.042984    3.259172
#> 3     2.186556    3.023356     2.267548    3.259172
#> 4     1.348651    2.226045     5.818421    3.259172
#> 5     4.700269    6.212603     4.042984    3.259172
#> 6     8.051888    8.604538     9.369294    8.587901
head(fscale(d, f, w, mean = FALSE,            # Preserving group means
            sd = "within.sd"))                # and setting group-sd to fsd(fwithin(d, f, w), w = w)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.124988    3.495163     1.323723   0.1488517
#> 2     4.847178    3.041600     1.323723   0.1488517
#> 3     4.569368    3.223025     1.082215   0.1488517
#> 4     4.430462    3.132312     1.565231   0.1488517
#> 5     4.986083    3.585875     1.323723   0.1488517
#> 6     5.541704    3.858013     2.048247   0.5054182
head(fscale(d, f, w, mean = "overall.mean",   # Full harmonization of group means and variances,
            sd = "within.sd"))                # while preserving the level and scale of the data.
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.928819    3.133280     3.591337    1.082571
#> 2     5.651009    2.679717     3.591337    1.082571
#> 3     5.373199    2.861142     3.349829    1.082571
#> 4     5.234294    2.770429     3.832845    1.082571
#> 5     5.789914    3.223992     3.591337    1.082571
#> 6     6.345535    3.496130     4.315861    1.439137

head(get_vars(iris, 1:2))                      # Use get_vars for fast selecting, gv is shortcut
#>   Sepal.Length Sepal.Width
#> 1          5.1         3.5
#> 2          4.9         3.0
#> 3          4.7         3.2
#> 4          4.6         3.1
#> 5          5.0         3.6
#> 6          5.4         3.9
head(fhdbetween(gv(iris, 1:2), gv(iris, 3:5))) # Linear prediction with factors and covariates
#>   Sepal.Length Sepal.Width
#> 1     4.950107    3.389732
#> 2     4.950107    3.389732
#> 3     4.859513    3.374264
#> 4     5.040702    3.405199
#> 5     4.950107    3.389732
#> 6     5.220692    3.560823
head(fhdwithin(gv(iris, 1:2), gv(iris, 3:5)))  # Linear partialling out factors and covariates
#>   Sepal.Length Sepal.Width
#> 1   0.14989286   0.1102684
#> 2  -0.05010714  -0.3897316
#> 3  -0.15951256  -0.1742640
#> 4  -0.44070173  -0.3051992
#> 5   0.04989286   0.2102684
#> 6   0.17930818   0.3391766
ss(iris, 1:10, 1:2)                            # Similarly fsubset/ss for fast subsetting rows
#>    Sepal.Length Sepal.Width
#> 1           5.1         3.5
#> 2           4.9         3.0
#> 3           4.7         3.2
#> 4           4.6         3.1
#> 5           5.0         3.6
#> 6           5.4         3.9
#> 7           4.6         3.4
#> 8           5.0         3.4
#> 9           4.4         2.9
#> 10          4.9         3.1

# Simple Time-Computations..
head(flag(AirPassengers, -1:3))                # One lead and three lags
#>       F1  --  L1  L2  L3
#> [1,] 118 112  NA  NA  NA
#> [2,] 132 118 112  NA  NA
#> [3,] 129 132 118 112  NA
#> [4,] 121 129 132 118 112
#> [5,] 135 121 129 132 118
#> [6,] 148 135 121 129 132
head(fdiff(EuStockMarkets,                     # Suitably lagged first and second differences
      c(1, frequency(EuStockMarkets)), diff = 1:2))
#>      D1.DAX D2.DAX L260D1.DAX L260D2.DAX D1.SMI D2.SMI L260D1.SMI L260D2.SMI
#> [1,]     NA     NA         NA         NA     NA     NA         NA         NA
#> [2,] -15.12     NA         NA         NA   10.4     NA         NA         NA
#> [3,]  -7.12   8.00         NA         NA   -9.9  -20.3         NA         NA
#> [4,]  14.53  21.65         NA         NA    5.5   15.4         NA         NA
#>      D1.CAC D2.CAC L260D1.CAC L260D2.CAC D1.FTSE D2.FTSE L260D1.FTSE
#> [1,]     NA     NA         NA         NA      NA      NA          NA
#> [2,]  -22.3     NA         NA         NA    16.6      NA          NA
#> [3,]  -32.5  -10.2         NA         NA   -12.0   -28.6          NA
#> [4,]   -9.9   22.6         NA         NA    22.2    34.2          NA
#>      L260D2.FTSE
#> [1,]          NA
#> [2,]          NA
#> [3,]          NA
#> [4,]          NA
#>  [ reached getOption("max.print") -- omitted 2 rows ]
head(fdiff(EuStockMarkets, rho = 0.87))        # Quasi-differences (x_t - rho*x_t-1)
#>           DAX     SMI     CAC    FTSE
#> [1,]       NA      NA      NA      NA
#> [2,] 196.6175 228.553 208.164 334.268
#> [3,] 202.6519 209.605 195.065 307.826
#> [4,] 223.3763 223.718 213.440 340.466
#> [5,] 207.8552 221.433 237.053 335.452
#> [6,] 202.8108 204.258 215.203 305.111
head(fdiff(EuStockMarkets, log = TRUE))        # Log-differences
#>               DAX          SMI          CAC         FTSE
#> [1,]           NA           NA           NA           NA
#> [2,] -0.009326550  0.006178360 -0.012658756  0.006770286
#> [3,] -0.004422175 -0.005880448 -0.018740638 -0.004889587
#> [4,]  0.009003794  0.003271184 -0.005779182  0.009027020
#> [5,] -0.001778217  0.001483372  0.008743353  0.005771847
#> [6,] -0.004676712 -0.008933417 -0.005120160 -0.007230164
head(fgrowth(EuStockMarkets))                  # Exact growth rates (percentage change)
#>             DAX        SMI        CAC       FTSE
#> [1,]         NA         NA         NA         NA
#> [2,] -0.9283193  0.6197485 -1.2578971  0.6793256
#> [3,] -0.4412412 -0.5863192 -1.8566124 -0.4877652
#> [4,]  0.9044450  0.3276540 -0.5762515  0.9067887
#> [5,] -0.1776637  0.1484472  0.8781687  0.5788536
#> [6,] -0.4665793 -0.8893632 -0.5107074 -0.7204089
head(fgrowth(EuStockMarkets, logdiff = TRUE))  # Log-difference growth rates (percentage change)
#>             DAX        SMI        CAC       FTSE
#> [1,]         NA         NA         NA         NA
#> [2,] -0.9326550  0.6178360 -1.2658756  0.6770286
#> [3,] -0.4422175 -0.5880448 -1.8740638 -0.4889587
#> [4,]  0.9003794  0.3271184 -0.5779182  0.9027020
#> [5,] -0.1778217  0.1483372  0.8743353  0.5771847
#> [6,] -0.4676712 -0.8933417 -0.5120160 -0.7230164
# Note that it is not necessary to use factors for grouping.
fmean(gv(mtcars, -c(2,8:9)), mtcars$cyl) # Can also use vector (internally converted using qF())
#>        mpg     disp        hp     drat       wt     qsec     gear     carb
#> 4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 4.090909 1.545455
#> 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 3.857143 3.428571
#> 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 3.285714 3.500000
fmean(gv(mtcars, -c(2,8:9)),
      gv(mtcars, c(2,8:9)))              # or a list of vector (internally grouped using GRP())
#>            mpg     disp        hp     drat       wt     qsec     gear     carb
#> 4.0.1 26.00000 120.3000  91.00000 4.430000 2.140000 16.70000 5.000000 2.000000
#> 4.1.0 22.90000 135.8667  84.66667 3.770000 2.935000 20.97000 3.666667 1.666667
#> 4.1.1 28.37143  89.8000  80.57143 4.148571 2.028286 18.70000 4.142857 1.428571
#> 6.0.1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667
#> 6.1.0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000
#> 8.0.0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333
#> 8.0.1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000
g <- GRP(mtcars, ~ cyl + vs + am)        # It is also possible to create grouping objects
print(g)                                 # These are instructive to learn about the grouping,
#> collapse grouping object of length 32 with 7 ordered groups
#> 
#> Call: GRP.default(X = mtcars, by = ~cyl + vs + am), X is unsorted
#> 
#> Distribution of group sizes: 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   2.500   3.000   4.571   5.500  12.000 
#> 
#> Groups with sizes: 
#> 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1 
#>     1     3     7     3     4    12     2 
plot(g)                                  # and are directly handed down to C++ code

fmean(gv(mtcars, -c(2,8:9)), g)          # This can speed up multiple computations over same groups
#>            mpg     disp        hp     drat       wt     qsec     gear     carb
#> 4.0.1 26.00000 120.3000  91.00000 4.430000 2.140000 16.70000 5.000000 2.000000
#> 4.1.0 22.90000 135.8667  84.66667 3.770000 2.935000 20.97000 3.666667 1.666667
#> 4.1.1 28.37143  89.8000  80.57143 4.148571 2.028286 18.70000 4.142857 1.428571
#> 6.0.1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667
#> 6.1.0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000
#> 8.0.0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333
#> 8.0.1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000
fsd(gv(mtcars, -c(2,8:9)), g)
#>             mpg      disp       hp      drat        wt       qsec      gear
#> 4.0.1        NA        NA       NA        NA        NA         NA        NA
#> 4.1.0 1.4525839 13.969371 19.65536 0.1300000 0.4075230 1.67143651 0.5773503
#> 4.1.1 4.7577005 18.802128 24.14441 0.3783926 0.4400840 0.94546285 0.3779645
#> 6.0.1 0.7505553  8.660254 37.52777 0.1616581 0.1281601 0.76872188 0.5773503
#> 6.1.0 1.6317169 44.742634  9.17878 0.5919459 0.1162164 0.81590441 0.5773503
#> 8.0.0 2.7743959 71.823494 33.35984 0.2302749 0.7683069 0.80164745 0.0000000
#> 8.0.1 0.5656854 35.355339 50.20458 0.4808326 0.2828427 0.07071068 0.0000000
#>            carb
#> 4.0.1        NA
#> 4.1.0 0.5773503
#> 4.1.1 0.5345225
#> 6.0.1 1.1547005
#> 6.1.0 1.7320508
#> 8.0.0 0.9003366
#> 8.0.1 2.8284271

# Factors can efficiently be created using qF()
f1 <- qF(mtcars$cyl)                     # Unlike GRP objects, factors are checked for NA's
f2 <- qF(mtcars$cyl, na.exclude = FALSE) # This can however be avoided through this option
class(f2)                                # Note the added class
#> [1] "factor"      "na.included"

library(microbenchmark)
microbenchmark(fmean(mtcars, f1), fmean(mtcars, f2)) # A minor difference, larger on larger data
#> Unit: microseconds
#>               expr   min    lq    mean median    uq    max neval
#>  fmean(mtcars, f1) 3.895 4.141 4.48745  4.223 4.305 27.962   100
#>  fmean(mtcars, f2) 3.854 4.018 4.14838  4.059 4.141  9.635   100

with(mtcars, finteraction(cyl, vs, am))  # Efficient interactions of vectors and/or factors
#>  [1] 6.0.1 6.0.1 4.1.1 6.1.0 8.0.0 6.1.0 8.0.0 4.1.0 4.1.0 6.1.0 6.1.0 8.0.0
#> [13] 8.0.0 8.0.0 8.0.0 8.0.0 8.0.0 4.1.1 4.1.1 4.1.1 4.1.0 8.0.0 8.0.0 8.0.0
#> [25] 8.0.0 4.1.1 4.0.1 4.1.1 8.0.1 6.0.1 8.0.1 4.1.1
#> Levels: 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1
finteraction(gv(mtcars, c(2,8:9)))       # .. or lists of vectors/factors
#>  [1] 6.0.1 6.0.1 4.1.1 6.1.0 8.0.0 6.1.0 8.0.0 4.1.0 4.1.0 6.1.0 6.1.0 8.0.0
#> [13] 8.0.0 8.0.0 8.0.0 8.0.0 8.0.0 4.1.1 4.1.1 4.1.1 4.1.0 8.0.0 8.0.0 8.0.0
#> [25] 8.0.0 4.1.1 4.0.1 4.1.1 8.0.1 6.0.1 8.0.1 4.1.1
#> Levels: 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1

# Simple row- or column-wise computations on matrices or data frames with dapply()
dapply(mtcars, quantile)                 # column quantiles
#>         mpg cyl    disp    hp  drat      wt    qsec vs am gear carb
#> 0%   10.400   4  71.100  52.0 2.760 1.51300 14.5000  0  0    3    1
#> 25%  15.425   4 120.825  96.5 3.080 2.58125 16.8925  0  0    3    2
#> 50%  19.200   6 196.300 123.0 3.695 3.32500 17.7100  0  0    4    2
#> 75%  22.800   8 326.000 180.0 3.920 3.61000 18.9000  1  1    4    4
#> 100% 33.900   8 472.000 335.0 4.930 5.42400 22.9000  1  1    5    8
dapply(mtcars, quantile, MARGIN = 1)     # Row-quantiles
#>                   0%    25%   50%    75%  100%
#> Mazda RX4          0 3.2600 4.000 18.730 160.0
#> Mazda RX4 Wag      0 3.3875 4.000 19.010 160.0
#> Datsun 710         1 1.6600 4.000 20.705 108.0
#> Hornet 4 Drive     0 2.0000 3.215 20.420 258.0
#> Hornet Sportabout  0 2.5000 3.440 17.860 360.0
#> Valiant            0 1.8800 3.460 19.160 225.0
#> Duster 360         0 3.1050 4.000 15.070 360.0
#> Merc 240D          0 2.5950 4.000 22.200 146.7
#> Merc 230           0 2.5750 4.000 22.850 140.8
#> Merc 280           0 3.6800 4.000 18.750 167.6
#> Merc 280C          0 3.6800 4.000 18.350 167.6
#> Merc 450SE         0 3.0000 4.070 16.900 275.8
#> Merc 450SL         0 3.0000 3.730 17.450 275.8
#> Merc 450SLC        0 3.0000 3.780 16.600 275.8
#>  [ reached 'max' / getOption("max.print") -- omitted 18 rows ]
  # dapply preserves the data structure of any matrices / data frames passed
  # Some fast matrix row/column functions are also provided by the matrixStats package
# Similarly, BY performs grouped comptations
BY(mtcars, f2, quantile)
#>         mpg cyl   disp    hp  drat     wt  qsec vs  am gear carb
#> 4.0%   21.4   4  71.10  52.0 3.690 1.5130 16.70  0 0.0    3    1
#> 4.25%  22.8   4  78.85  65.5 3.810 1.8850 18.56  1 0.5    4    1
#> 4.50%  26.0   4 108.00  91.0 4.080 2.2000 18.90  1 1.0    4    2
#> 4.75%  30.4   4 120.65  96.0 4.165 2.6225 19.95  1 1.0    4    2
#> 4.100% 33.9   4 146.70 113.0 4.930 3.1900 22.90  1 1.0    5    2
#> 6.0%   17.8   6 145.00 105.0 2.760 2.6200 15.50  0 0.0    3    1
#>  [ reached 'max' / getOption("max.print") -- omitted 9 rows ]
BY(mtcars, f2, quantile, expand.wide = TRUE)
#>   mpg.0% mpg.25% mpg.50% mpg.75% mpg.100% cyl.0% cyl.25% cyl.50% cyl.75%
#> 4   21.4    22.8      26    30.4     33.9      4       4       4       4
#>   cyl.100% disp.0% disp.25% disp.50% disp.75% disp.100% hp.0% hp.25% hp.50%
#> 4        4    71.1    78.85      108   120.65     146.7    52   65.5     91
#>   hp.75% hp.100% drat.0% drat.25% drat.50% drat.75% drat.100% wt.0% wt.25%
#> 4     96     113    3.69     3.81     4.08    4.165      4.93 1.513  1.885
#>   wt.50% wt.75% wt.100% qsec.0% qsec.25% qsec.50% qsec.75% qsec.100% vs.0%
#> 4    2.2 2.6225    3.19    16.7    18.56     18.9    19.95      22.9     0
#>   vs.25% vs.50% vs.75% vs.100% am.0% am.25% am.50% am.75% am.100% gear.0%
#> 4      1      1      1       1     0    0.5      1      1       1       3
#>   gear.25% gear.50% gear.75% gear.100% carb.0% carb.25% carb.50% carb.75%
#> 4        4        4        4         5       1        1        2        2
#>   carb.100%
#> 4         2
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# For efficient (grouped) replacing and sweeping out computed statistics, use TRA()
sds <- fsd(mtcars)
head(TRA(mtcars, sds, "/"))     # Simple scaling (if sd's not needed, use fsd(mtcars, TRA = "/"))
#>                        mpg     cyl      disp       hp     drat       wt
#> Mazda RX4         3.484351 3.35961 1.2909608 1.604367 7.294100 2.677684
#> Mazda RX4 Wag     3.484351 3.35961 1.2909608 1.604367 7.294100 2.938298
#> Datsun 710        3.783009 2.23974 0.8713986 1.356419 7.200586 2.371079
#> Hornet 4 Drive    3.550719 3.35961 2.0816744 1.604367 5.760468 3.285784
#> Hornet Sportabout 3.102731 4.47948 2.9046619 2.552402 5.891388 3.515738
#> Valiant           3.003178 3.35961 1.8154137 1.531441 5.161978 3.536178
#>                        qsec       vs       am     gear      carb
#> Mazda RX4          9.211261 0.000000 2.004044 5.421494 2.4764735
#> Mazda RX4 Wag      9.524645 0.000000 2.004044 5.421494 2.4764735
#> Datsun 710        10.414433 1.984063 2.004044 5.421494 0.6191184
#> Hornet 4 Drive    10.878913 1.984063 0.000000 4.066120 0.6191184
#> Hornet Sportabout  9.524645 0.000000 0.000000 4.066120 1.2382368
#> Valiant           11.315413 1.984063 0.000000 4.066120 0.6191184

microbenchmark(TRA(mtcars, sds, "/"), sweep(mtcars, 2, sds, "/")) # A remarkable performance gain..
#> Unit: microseconds
#>                        expr     min       lq      mean   median      uq
#>       TRA(mtcars, sds, "/")   2.214   2.8085   4.87326   3.9155   5.863
#>  sweep(mtcars, 2, sds, "/") 320.210 332.9405 398.70204 357.8275 440.463
#>       max neval
#>    17.712   100
#>  1230.369   100

sds <- fsd(mtcars, f2)
head(TRA(mtcars, sds, "/", f2)) # Groupd scaling (if sd's not needed: fsd(mtcars, f2, TRA = "/"))
#>                         mpg cyl     disp       hp      drat       wt      qsec
#> Mazda RX4         14.447218 Inf 3.849628 4.534121  8.192327 7.352414  9.643407
#> Mazda RX4 Wag     14.447218 Inf 3.849628 4.534121  8.192327 8.068012  9.971493
#> Datsun 710         5.055626 Inf 4.019114 4.442421 10.534350 4.073293 11.061282
#> Hornet 4 Drive    14.722403 Inf 6.207525 4.534121  6.469838 9.022142 11.389297
#> Hornet Sportabout  7.304550 Inf 5.311981 3.432928  8.459515 4.529864 14.230606
#> Valiant           12.452126 Inf 5.413539 4.328025  5.797647 9.709677 11.846275
#>                         vs       am     gear      carb
#> Mazda RX4         0.000000 1.870829 5.796551 2.2067091
#> Mazda RX4 Wag     0.000000 1.870829 5.796551 2.2067091
#> Datsun 710        3.316625 2.140872 7.416198 1.9148542
#> Hornet 4 Drive    1.870829 0.000000 4.347413 0.5516773
#> Hornet Sportabout      NaN 0.000000 4.130678 1.2848321
#> Valiant           1.870829 0.000000 4.347413 0.5516773

# All functions above perserve the structure of matrices / data frames
# If conversions are required, use these efficient functions:
mtcarsM <- qM(mtcars)                      # Matrix from data.frame
head(qDF(mtcarsM))                         # data.frame from matrix columns
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
head(mrtl(mtcarsM, TRUE, "data.frame"))    # data.frame from matrix rows, etc..
#>     Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
#> mpg        21            21       22.8           21.4              18.7    18.1
#> cyl         6             6        4.0            6.0               8.0     6.0
#>     Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL
#> mpg       14.3      24.4     22.8     19.2      17.8       16.4       17.3
#> cyl        8.0       4.0      4.0      6.0       6.0        8.0        8.0
#>     Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial
#> mpg        15.2               10.4                10.4              14.7
#> cyl         8.0                8.0                 8.0               8.0
#>     Fiat 128 Honda Civic Toyota Corolla Toyota Corona Dodge Challenger
#> mpg     32.4        30.4           33.9          21.5             15.5
#> cyl      4.0         4.0            4.0           4.0              8.0
#>     AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2
#> mpg        15.2       13.3             19.2      27.3            26
#> cyl         8.0        8.0              8.0       4.0             4
#>     Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> mpg         30.4           15.8         19.7            15       21.4
#> cyl          4.0            8.0          6.0             8        4.0
#>  [ reached 'max' / getOption("max.print") -- omitted 4 rows ]
head(qDT(mtcarsM, "cars"))                 # Saving row.names when converting matrix to data.table
#>                 cars   mpg   cyl  disp    hp  drat    wt  qsec    vs    am
#>               <char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:         Mazda RX4  21.0     6   160   110  3.90 2.620 16.46     0     1
#> 2:     Mazda RX4 Wag  21.0     6   160   110  3.90 2.875 17.02     0     1
#> 3:        Datsun 710  22.8     4   108    93  3.85 2.320 18.61     1     1
#> 4:    Hornet 4 Drive  21.4     6   258   110  3.08 3.215 19.44     1     0
#>     gear  carb
#>    <num> <num>
#> 1:     4     4
#> 2:     4     4
#> 3:     4     1
#> 4:     3     1
#>  [ reached getOption("max.print") -- omitted 2 rows ]
head(qDT(mtcars, "cars"))                  # Same use a data.frame
#>                 cars   mpg   cyl  disp    hp  drat    wt  qsec    vs    am
#>               <char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:         Mazda RX4  21.0     6   160   110  3.90 2.620 16.46     0     1
#> 2:     Mazda RX4 Wag  21.0     6   160   110  3.90 2.875 17.02     0     1
#> 3:        Datsun 710  22.8     4   108    93  3.85 2.320 18.61     1     1
#> 4:    Hornet 4 Drive  21.4     6   258   110  3.08 3.215 19.44     1     0
#>     gear  carb
#>    <num> <num>
#> 1:     4     4
#> 2:     4     4
#> 3:     4     1
#> 4:     3     1
#>  [ reached getOption("max.print") -- omitted 2 rows ]
 
## Now let's get some real data and see how we can use this power for data manipulation
head(wlddev) # World Bank World Development Data: 216 countries, 61 years, 5 series (columns 9-13)
#>       country iso3c       date year decade     region     income  OECD PCGDP
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA
#> 5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA
#>   LIFEEX GINI       ODA     POP
#> 1 32.446   NA 116769997 8996973
#> 2 32.962   NA 232080002 9169410
#> 3 33.471   NA 112839996 9351441
#> 4 33.971   NA 237720001 9543205
#> 5 34.463   NA 295920013 9744781
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

# Starting with some discriptive tools...
namlab(wlddev, class = TRUE)           # Show variable names, labels and classes
#>    Variable     Class
#> 1   country character
#> 2     iso3c    factor
#> 3      date      Date
#> 4      year   integer
#> 5    decade   integer
#> 6    region    factor
#> 7    income    factor
#> 8      OECD   logical
#> 9     PCGDP   numeric
#> 10   LIFEEX   numeric
#> 11     GINI   numeric
#> 12      ODA   numeric
#> 13      POP   numeric
#>                                                                                Label
#> 1                                                                       Country Name
#> 2                                                                       Country Code
#> 3                                                         Date Recorded (Fictitious)
#> 4                                                                               Year
#> 5                                                                             Decade
#> 6                                                                             Region
#> 7                                                                       Income Level
#> 8                                                            Is OECD Member Country?
#> 9                                                 GDP per capita (constant 2010 US$)
#> 10                                           Life expectancy at birth, total (years)
#> 11                                                  Gini index (World Bank estimate)
#> 12 Net official development assistance and official aid received (constant 2018 US$)
#> 13                                                                 Population, total
fnobs(wlddev)                          # Observation count
#> country   iso3c    date    year  decade  region  income    OECD   PCGDP  LIFEEX 
#>   13176   13176   13176   13176   13176   13176   13176   13176    9470   11670 
#>    GINI     ODA     POP 
#>    1744    8608   12919 
pwnobs(wlddev)                         # Pairwise observation count
#>         country iso3c  date  year decade region income  OECD PCGDP LIFEEX GINI
#> country   13176 13176 13176 13176  13176  13176  13176 13176  9470  11670 1744
#> iso3c     13176 13176 13176 13176  13176  13176  13176 13176  9470  11670 1744
#> date      13176 13176 13176 13176  13176  13176  13176 13176  9470  11670 1744
#> year      13176 13176 13176 13176  13176  13176  13176 13176  9470  11670 1744
#> decade    13176 13176 13176 13176  13176  13176  13176 13176  9470  11670 1744
#>          ODA   POP
#> country 8608 12919
#> iso3c   8608 12919
#> date    8608 12919
#> year    8608 12919
#> decade  8608 12919
#>  [ reached getOption("max.print") -- omitted 8 rows ]
head(fnobs(wlddev, wlddev$country))    # Grouped observation count
#>                country iso3c date year decade region income OECD PCGDP LIFEEX
#> Afghanistan         61    61   61   61     61     61     61   61    18     60
#> Albania             61    61   61   61     61     61     61   61    40     60
#> Algeria             61    61   61   61     61     61     61   61    60     60
#> American Samoa      61    61   61   61     61     61     61   61    17      0
#> Andorra             61    61   61   61     61     61     61   61    50      0
#>                GINI ODA POP
#> Afghanistan       0  60  60
#> Albania           9  32  60
#> Algeria           3  60  60
#> American Samoa    0   0  60
#> Andorra           0   0  60
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
fndistinct(wlddev)                     # Distinct values
#> country   iso3c    date    year  decade  region  income    OECD   PCGDP  LIFEEX 
#>     216     216      61      61       7       7       4       2    9470   10548 
#>    GINI     ODA     POP 
#>     368    7832   12877 
descr(wlddev)                          # Describe data
#> Dataset: wlddev, 13 Variables, N = 13176
#> --------------------------------------------------------------------------------
#> country (character): Country Name
#> Statistics
#>       N  Ndist
#>   13176    216
#> Table
#>                       Freq   Perc
#> Afghanistan             61   0.46
#> Albania                 61   0.46
#> Algeria                 61   0.46
#> American Samoa          61   0.46
#> Andorra                 61   0.46
#> Angola                  61   0.46
#> Antigua and Barbuda     61   0.46
#> Argentina               61   0.46
#> Armenia                 61   0.46
#> Aruba                   61   0.46
#> Australia               61   0.46
#> Austria                 61   0.46
#> Azerbaijan              61   0.46
#> Bahamas, The            61   0.46
#> ... 202 Others       12322  93.52
#> 
#> Summary of Table Frequencies
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      61      61      61      61      61      61 
#> --------------------------------------------------------------------------------
#> iso3c (factor): Country Code
#> Statistics
#>       N  Ndist
#>   13176    216
#> Table
#>                  Freq   Perc
#> ABW                61   0.46
#> AFG                61   0.46
#> AGO                61   0.46
#> ALB                61   0.46
#> AND                61   0.46
#> ARE                61   0.46
#> ARG                61   0.46
#> ARM                61   0.46
#> ASM                61   0.46
#> ATG                61   0.46
#> AUS                61   0.46
#> AUT                61   0.46
#> AZE                61   0.46
#> BDI                61   0.46
#> ... 202 Others  12322  93.52
#> 
#> Summary of Table Frequencies
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      61      61      61      61      61      61 
#> --------------------------------------------------------------------------------
#> date (Date): Date Recorded (Fictitious)
#> Statistics
#>          N       Ndist         Min         Max  
#>      13176          61  1961-01-01  2021-01-01  
#> --------------------------------------------------------------------------------
#> year (integer): Year
#> Statistics
#>       N  Ndist  Mean     SD   Min   Max  Skew  Kurt
#>   13176     61  1990  17.61  1960  2020    -0   1.8
#> Quantiles
#>     1%    5%   10%   25%   50%   75%   90%   95%   99%
#>   1960  1963  1966  1975  1990  2005  2014  2017  2020
#> --------------------------------------------------------------------------------
#> decade (integer): Decade
#> Statistics
#>       N  Ndist     Mean     SD   Min   Max  Skew  Kurt
#>   13176      7  1985.57  17.51  1960  2020  0.03  1.79
#> Quantiles
#>     1%    5%   10%   25%   50%   75%   90%   95%   99%
#>   1960  1960  1960  1970  1990  2000  2010  2010  2020
#> --------------------------------------------------------------------------------
#> region (factor): Region
#> Statistics
#>       N  Ndist
#>   13176      7
#> Table
#>                             Freq   Perc
#> Europe & Central Asia       3538  26.85
#> Sub-Saharan Africa          2928  22.22
#> Latin America & Caribbean   2562  19.44
#> East Asia & Pacific         2196  16.67
#> Middle East & North Africa  1281   9.72
#> South Asia                   488   3.70
#> North America                183   1.39
#> --------------------------------------------------------------------------------
#> income (factor): Income Level
#> Statistics
#>       N  Ndist
#>   13176      4
#> Table
#>                      Freq   Perc
#> High income          4819  36.57
#> Upper middle income  3660  27.78
#> Lower middle income  2867  21.76
#> Low income           1830  13.89
#> --------------------------------------------------------------------------------
#> OECD (logical): Is OECD Member Country?
#> Statistics
#>       N  Ndist
#>   13176      2
#> Table
#>         Freq   Perc
#> FALSE  10980  83.33
#> TRUE    2196  16.67
#> --------------------------------------------------------------------------------
#> PCGDP (numeric): GDP per capita (constant 2010 US$)
#> Statistics (28.13% NAs)
#>      N  Ndist      Mean        SD     Min        Max  Skew   Kurt
#>   9470   9470  12048.78  19077.64  132.08  196061.42  3.13  17.12
#> Quantiles
#>       1%      5%     10%      25%      50%       75%       90%       95%
#>   227.71  399.62  555.55  1303.19  3767.16  14787.03  35646.02  48507.84
#>        99%
#>   92340.28
#> --------------------------------------------------------------------------------
#> LIFEEX (numeric): Life expectancy at birth, total (years)
#> Statistics (11.43% NAs)
#>       N  Ndist  Mean     SD    Min    Max   Skew  Kurt
#>   11670  10548  64.3  11.48  18.91  85.42  -0.67  2.67
#> Quantiles
#>      1%     5%    10%    25%    50%    75%    90%    95%    99%
#>   35.83  42.77  46.83  56.36  67.44  72.95  77.08  79.34  82.36
#> --------------------------------------------------------------------------------
#> GINI (numeric): Gini index (World Bank estimate)
#> Statistics (86.76% NAs)
#>      N  Ndist   Mean   SD   Min   Max  Skew  Kurt
#>   1744    368  38.53  9.2  20.7  65.8   0.6  2.53
#> Quantiles
#>     1%    5%   10%   25%   50%  75%   90%    95%   99%
#>   24.6  26.3  27.6  31.5  36.4   45  52.6  55.98  60.5
#> --------------------------------------------------------------------------------
#> ODA (numeric): Net official development assistance and official aid received (constant 2018 US$)
#> Statistics (34.67% NAs)
#>      N  Ndist        Mean          SD          Min             Max  Skew
#>   8608   7832  454'720131  868'712654  -997'679993  2.56715605e+10  6.98
#>     Kurt
#>   114.89
#> Quantiles
#>             1%           5%          10%          25%         50%         75%
#>   -12'593999.7  1'363500.01  8'347000.31  44'887499.8  165'970001  495'042503
#>              90%             95%             99%
#>   1.18400697e+09  1.93281696e+09  3.73380782e+09
#> --------------------------------------------------------------------------------
#> POP (numeric): Population, total
#> Statistics (1.95% NAs)
#>       N  Ndist         Mean          SD   Min             Max  Skew    Kurt
#>   12919  12877  24'245971.6  102'120674  2833  1.39771500e+09  9.75  108.91
#> Quantiles
#>        1%       5%      10%     25%       50%        75%          90%
#>   8698.84  31083.3  62268.4  443791  4'072517  12'816178  46'637331.4
#>           95%         99%
#>   81'177252.5  308'862641
#> --------------------------------------------------------------------------------
varying(wlddev, ~ country)             # Show which variables vary within countries
#>  iso3c   date   year decade region income   OECD  PCGDP LIFEEX   GINI    ODA 
#>  FALSE   TRUE   TRUE   TRUE  FALSE  FALSE  FALSE   TRUE   TRUE   TRUE   TRUE 
#>    POP 
#>   TRUE 
qsu(wlddev, pid = ~ country,           # Panel-summarize columns 9 though 12 of this data
    cols = 9:12, vlabels = TRUE)       # (between and within countries)
#> , , PCGDP: GDP per capita (constant 2010 US$)
#> 
#>              N/T        Mean          SD          Min         Max
#> Overall     9470   12048.778  19077.6416     132.0776  196061.417
#> Between      206  12962.6054  20189.9007     253.1886   141200.38
#> Within   45.9709   12048.778   6723.6808  -33504.8721  76767.5254
#> 
#> , , LIFEEX: Life expectancy at birth, total (years)
#> 
#>              N/T     Mean       SD      Min      Max
#> Overall    11670  64.2963  11.4764   18.907  85.4171
#> Between      207  64.9537   9.8936  40.9663  85.4171
#> Within   56.3768  64.2963   6.0842  32.9068  84.4198
#> 
#> , , GINI: Gini index (World Bank estimate)
#> 
#>              N/T     Mean      SD      Min      Max
#> Overall     1744  38.5341  9.2006     20.7     65.8
#> Between      167  39.4233  8.1356  24.8667  61.7143
#> Within   10.4431  38.5341  2.9277  25.3917  55.3591
#> 
#> , , ODA: Net official development assistance and official aid received (constant 2018 US$)
#> 
#>              N/T        Mean          SD              Min             Max
#> Overall     8608  454'720131  868'712654      -997'679993  2.56715605e+10
#> Between      178  439'168412  569'049959       468717.916  3.62337432e+09
#> Within   48.3596  454'720131  650'709624  -2.44379420e+09  2.45610972e+10
#> 
qsu(wlddev, ~ region, ~ country,       # Do all of that by region and also compute higher moments
    cols = 9:12, higher = TRUE)        # -> returns a 4D array
#> , , Overall, PCGDP
#> 
#>                              N/T        Mean          SD         Min
#> East Asia & Pacific         1467  10513.2441  14383.5507    132.0776
#> Europe & Central Asia       2243  25992.9618  26435.1316    366.9354
#> Latin America & Caribbean   1976   7628.4477   8818.5055   1005.4085
#> Middle East & North Africa   842  13878.4213  18419.7912    578.5996
#> North America                180    48699.76  24196.2855  16405.9053
#> South Asia                   382   1235.9256   1611.2232    265.9625
#> Sub-Saharan Africa          2380   1840.0259   2596.0104    164.3366
#>                                    Max    Skew     Kurt
#> East Asia & Pacific         71992.1517  1.6392   4.7419
#> Europe & Central Asia       196061.417  2.2022  10.1977
#> Latin America & Caribbean   88391.3331  4.1702  29.3739
#> Middle East & North Africa  116232.753  2.4178   9.7669
#> North America               113236.091   0.938   2.9688
#> South Asia                    8476.564  2.7874  10.3402
#> Sub-Saharan Africa          20532.9523  3.1161  14.4175
#> 
#> , , Between, PCGDP
#> 
#>                             N/T        Mean          SD         Min         Max
#> East Asia & Pacific          34  10513.2441   12771.742    444.2899  39722.0077
#> Europe & Central Asia        56  25992.9618   24051.035    809.4753   141200.38
#> Latin America & Caribbean    38   7628.4477   8470.9708   1357.3326  77403.7443
#>                               Skew     Kurt
#> East Asia & Pacific         1.1488   2.7089
#> Europe & Central Asia       2.0026   9.0733
#> Latin America & Caribbean   4.4548  32.4956
#> 
#>  [ reached getOption("max.print") -- omitted 4 row(s) and 10 matrix slice(s) ]
qsu(wlddev, ~ region, ~ country, cols = 9:12,
    higher = TRUE, array = FALSE) |>                           # Return as a list of matrices..
unlist2d(c("Variable","Trans"), row.names = "Region") |> head()# and turn into a tidy data.frame
#>   Variable   Trans                     Region    N      Mean        SD
#> 1    PCGDP Overall        East Asia & Pacific 1467 10513.244 14383.551
#> 2    PCGDP Overall      Europe & Central Asia 2243 25992.962 26435.132
#> 3    PCGDP Overall  Latin America & Caribbean 1976  7628.448  8818.505
#> 4    PCGDP Overall Middle East & North Africa  842 13878.421 18419.791
#> 5    PCGDP Overall              North America  180 48699.760 24196.285
#> 6    PCGDP Overall                 South Asia  382  1235.926  1611.223
#>          Min        Max      Skew      Kurt
#> 1   132.0776  71992.152 1.6392248  4.741856
#> 2   366.9354 196061.417 2.2022472 10.197685
#> 3  1005.4085  88391.333 4.1701769 29.373869
#> 4   578.5996 116232.753 2.4177586  9.766883
#> 5 16405.9053 113236.091 0.9380056  2.968769
#> 6   265.9625   8476.564 2.7873830 10.340176
pwcor(num_vars(wlddev), P = TRUE)                           # Pairwise correlations with p-value
#>          year decade  PCGDP LIFEEX   GINI    ODA    POP
#> year      1     .99*   .16*   .47*  -.20*   .14*   .06*
#> decade   .99*    1     .15*   .46*  -.20*   .14*   .06*
#> PCGDP    .16*   .15*    1     .57*  -.44*  -.16*  -.06*
#> LIFEEX   .47*   .46*   .57*    1    -.35*  -.02    .03*
#> GINI    -.20*  -.20*  -.44*  -.35*    1    -.20*   .04 
#> ODA      .14*   .14*  -.16*  -.02   -.20*    1     .31*
#> POP      .06*   .06*  -.06*   .03*   .04    .31*    1  
pwcor(fmean(num_vars(wlddev), wlddev$country), P = TRUE)    # Correlating country means
#> Warning: the standard deviation is zero
#>           year  decade   PCGDP  LIFEEX    GINI     ODA     POP
#> year       NA      NA      NA      NA      NA      NA      NA 
#> decade     NA      1      .00     .00     .00     .00     .00 
#> PCGDP      NA     .00      1      .60*   -.42*   -.25*   -.07 
#> LIFEEX     NA     .00     .60*     1     -.40*   -.21*   -.02 
#> GINI       NA     .00    -.42*   -.40*     1     -.19*   -.04 
#> ODA        NA     .00    -.25*   -.21*   -.19*     1      .50*
#> POP        NA     .00    -.07    -.02    -.04     .50*     1  
pwcor(fwithin(num_vars(wlddev), wlddev$country), P = TRUE)  # Within-country correlations
#>          year decade  PCGDP LIFEEX   GINI    ODA    POP
#> year      1     .99*   .44*   .84*  -.21*   .19*   .24*
#> decade   .99*    1     .44*   .83*  -.19*   .18*   .24*
#> PCGDP    .44*   .44*    1     .31*  -.01   -.01    .06*
#> LIFEEX   .84*   .83*   .31*    1    -.16*   .17*   .29*
#> GINI    -.21*  -.19*  -.01   -.16*    1    -.08*   .01 
#> ODA      .19*   .18*  -.01    .17*  -.08*    1    -.11*
#> POP      .24*   .24*   .06*   .29*   .01   -.11*    1  
psacf(wlddev, ~country, ~year, cols = 9:12)                 # Panel-data Autocorrelation function

pspacf(wlddev, ~country, ~year, cols = 9:12)                # Partial panel-autocorrelations

psmat(wlddev, ~iso3c, ~year, cols = 9:12) |> plot()         # Convert panel to 3D array and plot


## collapse offers a few very efficent functions for data manipulation:
# Fast selecting and replacing columns
series <- get_vars(wlddev, 9:12)     # Same as wlddev[9:12] but 2x faster
series <- fselect(wlddev, PCGDP:ODA) # Same thing: > 100x faster than dplyr::select
get_vars(wlddev, 9:12) <- series     # Replace, 8x faster wlddev[9:12] <- series + replaces names
fselect(wlddev, PCGDP:ODA) <- series # Same thing

# Fast subsetting
head(fsubset(wlddev, country == "Ireland", -country, -iso3c))
#>         date year decade                region      income OECD PCGDP   LIFEEX
#> 1 1961-01-01 1960   1960 Europe & Central Asia High income TRUE    NA 69.79651
#> 2 1962-01-01 1961   1960 Europe & Central Asia High income TRUE    NA 69.97827
#> 3 1963-01-01 1962   1960 Europe & Central Asia High income TRUE    NA 70.13407
#> 4 1964-01-01 1963   1960 Europe & Central Asia High income TRUE    NA 70.27293
#> 5 1965-01-01 1964   1960 Europe & Central Asia High income TRUE    NA 70.40129
#> 6 1966-01-01 1965   1960 Europe & Central Asia High income TRUE    NA 70.52315
#>   GINI ODA     POP
#> 1   NA  NA 2828600
#> 2   NA  NA 2824400
#> 3   NA  NA 2836050
#> 4   NA  NA 2852650
#> 5   NA  NA 2866550
#> 6   NA  NA 2877300
head(fsubset(wlddev, country == "Ireland" & year > 1990, year, PCGDP:ODA))
#>   year    PCGDP   LIFEEX GINI ODA
#> 1 1991 24642.11 75.00527   NA  NA
#> 2 1992 25292.81 75.18095   NA  NA
#> 3 1993 25844.34 75.33612   NA  NA
#> 4 1994 27224.37 75.47680 36.9  NA
#> 5 1995 29694.65 75.61756 37.0  NA
#> 6 1996 31644.89 75.83171 35.6  NA
ss(wlddev, 1:10, 1:10) # This is an order of magnitude faster than wlddev[1:10, 1:10]
#>       country iso3c       date year decade     region     income  OECD PCGDP
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA
#> 5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA
#> 6 Afghanistan   AFG 1966-01-01 1965   1960 South Asia Low income FALSE    NA
#> 7 Afghanistan   AFG 1967-01-01 1966   1960 South Asia Low income FALSE    NA
#>   LIFEEX
#> 1 32.446
#> 2 32.962
#> 3 33.471
#> 4 33.971
#> 5 34.463
#> 6 34.948
#> 7 35.430
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

# Fast transforming
head(ftransform(wlddev, ODA_GDP = ODA / PCGDP, ODA_LIFEEX = sqrt(ODA) / LIFEEX))
#> Warning: NaNs produced
#>       country iso3c       date year decade     region     income  OECD PCGDP
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA
#>   LIFEEX GINI       ODA     POP ODA_GDP ODA_LIFEEX
#> 1 32.446   NA 116769997 8996973      NA   333.0462
#> 2 32.962   NA 232080002 9169410      NA   462.1738
#> 3 33.471   NA 112839996 9351441      NA   317.3678
#> 4 33.971   NA 237720001 9543205      NA   453.8627
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
settransform(wlddev, ODA_GDP = ODA / PCGDP, ODA_LIFEEX = sqrt(ODA) / LIFEEX) # by reference
#> Warning: NaNs produced
head(ftransform(wlddev, PCGDP = NULL, ODA = NULL, GINI_sum = fsum(GINI)))
#>       country iso3c       date year decade     region     income  OECD LIFEEX
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE 32.446
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE 32.962
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE 33.471
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE 33.971
#> 5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE 34.463
#>   GINI     POP ODA_GDP ODA_LIFEEX GINI_sum
#> 1   NA 8996973      NA   333.0462  67203.5
#> 2   NA 9169410      NA   462.1738  67203.5
#> 3   NA 9351441      NA   317.3678  67203.5
#> 4   NA 9543205      NA   453.8627  67203.5
#> 5   NA 9744781      NA   499.1535  67203.5
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
head(ftransformv(wlddev, 9:12, log))                   # Can also transform with lists of columns
#> Warning: NaNs produced
#>       country iso3c       date year decade     region     income  OECD PCGDP
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA
#>     LIFEEX GINI      ODA     POP ODA_GDP ODA_LIFEEX
#> 1 3.479577   NA 18.57572 8996973      NA   333.0462
#> 2 3.495355   NA 19.26259 9169410      NA   462.1738
#> 3 3.510679   NA 18.54148 9351441      NA   317.3678
#> 4 3.525507   NA 19.28660 9543205      NA   453.8627
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
head(ftransformv(wlddev, 9:12, fscale, apply = FALSE)) # apply = FALSE invokes fscale.data.frame
#>       country iso3c       date year decade     region     income  OECD PCGDP
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA
#>      LIFEEX GINI        ODA     POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283   NA -0.3890241 8996973      NA   333.0462
#> 2 -2.730321   NA -0.2562874 9169410      NA   462.1738
#> 3 -2.685969   NA -0.3935480 9351441      NA   317.3678
#> 4 -2.642402   NA -0.2497951 9543205      NA   453.8627
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
settransformv(wlddev, 9:12, fscale, apply = FALSE)     # Changing the data by reference
ftransform(wlddev) <- fscale(gv(wlddev, 9:12))         # Same thing (using replacement method)

library(magrittr) # Same thing, using magrittr
wlddev %<>% ftransformv(9:12, fscale, apply = FALSE)
wlddev %>% ftransform(gv(., 9:12) |>              # With compound pipes: Scaling and lagging
                        fscale() |> flag(0:2, iso3c, year)) |> head()
#>       country iso3c       date year decade     region     income  OECD PCGDP
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#>      LIFEEX GINI        ODA     POP ODA_GDP ODA_LIFEEX L1.PCGDP L2.PCGDP
#> 1 -2.775283   NA -0.3890241 8996973      NA   333.0462       NA       NA
#> 2 -2.730321   NA -0.2562874 9169410      NA   462.1738       NA       NA
#> 3 -2.685969   NA -0.3935480 9351441      NA   317.3678       NA       NA
#>   L1.LIFEEX L2.LIFEEX L1.GINI L2.GINI     L1.ODA     L2.ODA
#> 1        NA        NA      NA      NA         NA         NA
#> 2 -2.775283        NA      NA      NA -0.3890241         NA
#> 3 -2.730321 -2.775283      NA      NA -0.2562874 -0.3890241
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

# Fast reordering
head(roworder(wlddev, -country, year))
#>    country iso3c       date year decade             region              income
#> 1 Zimbabwe   ZWE 1961-01-01 1960   1960 Sub-Saharan Africa Lower middle income
#> 2 Zimbabwe   ZWE 1962-01-01 1961   1960 Sub-Saharan Africa Lower middle income
#> 3 Zimbabwe   ZWE 1963-01-01 1962   1960 Sub-Saharan Africa Lower middle income
#> 4 Zimbabwe   ZWE 1964-01-01 1963   1960 Sub-Saharan Africa Lower middle income
#>    OECD      PCGDP     LIFEEX GINI        ODA     POP     ODA_GDP ODA_LIFEEX
#> 1 FALSE -0.5794259 -0.9826503   NA         NA 3776681          NA         NA
#> 2 FALSE -0.5779547 -0.9422195   NA -0.5233262 3905034    97.77415   5.912678
#> 3 FALSE -0.5789920 -0.9018759   NA         NA 4039201          NA         NA
#> 4 FALSE -0.5775741 -0.8620551   NA -0.4094221 4178726 96162.61027 182.938198
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
head(colorder(wlddev, country, year))
#>       country year iso3c       date decade     region     income  OECD PCGDP
#> 1 Afghanistan 1960   AFG 1961-01-01   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan 1961   AFG 1962-01-01   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan 1962   AFG 1963-01-01   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan 1963   AFG 1964-01-01   1960 South Asia Low income FALSE    NA
#>      LIFEEX GINI        ODA     POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283   NA -0.3890241 8996973      NA   333.0462
#> 2 -2.730321   NA -0.2562874 9169410      NA   462.1738
#> 3 -2.685969   NA -0.3935480 9351441      NA   317.3678
#> 4 -2.642402   NA -0.2497951 9543205      NA   453.8627
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

# Fast renaming
head(frename(wlddev, country = Ctry, year = Yr))
#>          Ctry iso3c       date   Yr decade     region     income  OECD PCGDP
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA
#>      LIFEEX GINI        ODA     POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283   NA -0.3890241 8996973      NA   333.0462
#> 2 -2.730321   NA -0.2562874 9169410      NA   462.1738
#> 3 -2.685969   NA -0.3935480 9351441      NA   317.3678
#> 4 -2.642402   NA -0.2497951 9543205      NA   453.8627
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
setrename(wlddev, country = Ctry, year = Yr)     # By reference
head(frename(wlddev, tolower, cols = 9:12))
#>          Ctry iso3c       date   Yr decade     region     income  OECD pcgdp
#> 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA
#> 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA
#> 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA
#> 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA
#>      lifeex gini        oda     POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283   NA -0.3890241 8996973      NA   333.0462
#> 2 -2.730321   NA -0.2562874 9169410      NA   462.1738
#> 3 -2.685969   NA -0.3935480 9351441      NA   317.3678
#> 4 -2.642402   NA -0.2497951 9543205      NA   453.8627
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

# Fast grouping
fgroup_by(wlddev, Ctry, decade) |> fgroup_vars() |> head()
#>          Ctry decade
#> 1 Afghanistan   1960
#> 2 Afghanistan   1960
#> 3 Afghanistan   1960
#> 4 Afghanistan   1960
#> 5 Afghanistan   1960
#> 6 Afghanistan   1960
rm(wlddev)                                       # .. but only works with collapse functions

## Now lets start putting things together
wlddev |> fsubset(year > 1990, region, income, PCGDP:ODA) |>
  fgroup_by(region, income) |> fmean()         # Fast aggregation using the mean
#>                       region              income      PCGDP   LIFEEX     GINI
#> 1        East Asia & Pacific         High income 32671.0522 78.21996 32.95000
#> 2        East Asia & Pacific Lower middle income  1738.5111 65.45647 36.51972
#> 3        East Asia & Pacific Upper middle income  4575.8695 70.87431 40.64815
#> 4      Europe & Central Asia         High income 40814.1215 77.67583 30.94218
#> 5      Europe & Central Asia          Low income   695.7951 65.04128 32.13333
#> 6      Europe & Central Asia Lower middle income  1779.7149 68.79873 30.66176
#> 7      Europe & Central Asia Upper middle income  5182.1800 71.30199 34.85362
#> 8  Latin America & Caribbean         High income 19864.5057 75.55674 48.35714
#> 9  Latin America & Caribbean          Low income  1189.8492 58.97359 41.10000
#> 10 Latin America & Caribbean Lower middle income  1987.5292 69.31612 50.58125
#> 11 Latin America & Caribbean Upper middle income  6224.6674 72.58127 49.88547
#>          ODA
#> 1  112194118
#> 2  509965862
#> 3  219146704
#> 4  321907195
#> 5  244539286
#> 6  395371417
#> 7  463652924
#> 8   31307379
#> 9  753747928
#> 10 559540257
#> 11 188305879
#>  [ reached 'max' / getOption("max.print") -- omitted 12 rows ]

# Same thing using dplyr manipulation verbs
library(dplyr)
wlddev |> filter(year > 1990) |> select(region, income, PCGDP:ODA) |>
  group_by(region,income) |> fmean()       # This is already a lot faster than summarize_all(mean)
#> # A tibble: 23 × 6
#>    region                    income               PCGDP LIFEEX  GINI        ODA
#>    <fct>                     <fct>                <dbl>  <dbl> <dbl>      <dbl>
#>  1 East Asia & Pacific       High income         32671.   78.2  33.0 112194118.
#>  2 East Asia & Pacific       Lower middle income  1739.   65.5  36.5 509965862.
#>  3 East Asia & Pacific       Upper middle income  4576.   70.9  40.6 219146704.
#>  4 Europe & Central Asia     High income         40814.   77.7  30.9 321907195.
#>  5 Europe & Central Asia     Low income            696.   65.0  32.1 244539286.
#>  6 Europe & Central Asia     Lower middle income  1780.   68.8  30.7 395371417.
#>  7 Europe & Central Asia     Upper middle income  5182.   71.3  34.9 463652924.
#>  8 Latin America & Caribbean High income         19865.   75.6  48.4  31307379.
#>  9 Latin America & Caribbean Low income           1190.   59.0  41.1 753747928.
#> 10 Latin America & Caribbean Lower middle income  1988.   69.3  50.6 559540257.
#> # ℹ 13 more rows

wlddev |> fsubset(year > 1990, region, income, PCGDP:POP) |>
  fgroup_by(region, income) |> fmean(POP)     # Weighted group means
#>                       region              income     sum.POP      PCGDP
#> 1        East Asia & Pacific         High income  6165902760 37889.3406
#> 2        East Asia & Pacific Lower middle income 13784998066  2135.6182
#> 3        East Asia & Pacific Upper middle income 40150644873  3769.4215
#> 4      Europe & Central Asia         High income 13923291507 34583.4203
#> 5      Europe & Central Asia          Low income   203224216   722.7351
#> 6      Europe & Central Asia Lower middle income  2399154808  2145.1352
#> 7      Europe & Central Asia Upper middle income  8919358306  8238.8391
#> 8  Latin America & Caribbean         High income   842939686 13068.1667
#> 9  Latin America & Caribbean          Low income   267125746  1193.5889
#> 10 Latin America & Caribbean Lower middle income   815192217  2011.6260
#>      LIFEEX     GINI        ODA
#> 1  80.79250 32.81601  -79785907
#> 2  68.24548 36.40362 1060544119
#> 3  73.07773 40.38496 1229983586
#> 4  78.82923 32.27710 1107199019
#> 5  65.76604 32.22326  261043896
#> 6  68.98050 28.97857  556232624
#> 7  69.78322 38.66475 1187976647
#> 8  76.76809 48.85497   97880105
#> 9  59.34831 41.10000  794781510
#> 10 69.33538 50.52363  571015463
#>  [ reached 'max' / getOption("max.print") -- omitted 13 rows ]

wlddev |> fsubset(year > 1990, region, income, PCGDP:POP) |>
  fgroup_by(region, income) |> fsd(POP)       # Weighted group standard deviations
#>                       region              income     sum.POP       PCGDP
#> 1        East Asia & Pacific         High income  6165902760 11619.90339
#> 2        East Asia & Pacific Lower middle income 13784998066  1074.84975
#> 3        East Asia & Pacific Upper middle income 40150644873  2374.39410
#> 4      Europe & Central Asia         High income 13923291507 13593.46879
#> 5      Europe & Central Asia          Low income   203224216   238.47730
#> 6      Europe & Central Asia Lower middle income  2399154808   841.97662
#> 7      Europe & Central Asia Upper middle income  8919358306  3175.38606
#> 8  Latin America & Caribbean         High income   842939686  6273.64310
#> 9  Latin America & Caribbean          Low income   267125746    70.56928
#> 10 Latin America & Caribbean Lower middle income   815192217   580.31064
#>      LIFEEX     GINI        ODA
#> 1  2.730868 1.247116  129087430
#> 2  4.230245 4.316708  943798728
#> 3  2.428883 2.458222 1519636778
#> 4  2.895801 2.994940 1131585991
#> 5  4.555972 1.547793  112926454
#> 6  1.688117 4.573107  370416116
#> 7  3.707457 4.227846 1068502287
#> 8  2.421882 5.289102   71662290
#> 9  2.815436 0.000000  566758933
#> 10 4.220554 5.684628  243414533
#>  [ reached 'max' / getOption("max.print") -- omitted 13 rows ]

wlddev |> na_omit(cols = "POP") |> fgroup_by(region, income) |>
  fselect(PCGDP:POP) |> fnth(0.75, POP)       # Weighted group third quartile
#>                       region              income     sum.POP     PCGDP   LIFEEX
#> 1        East Asia & Pacific         High income 11407808149 42237.359 81.72529
#> 2        East Asia & Pacific Lower middle income 22174820629  2359.522 69.86711
#> 3        East Asia & Pacific Upper middle income 69639871478  3815.390 73.83074
#> 4      Europe & Central Asia         High income 27285316560 37968.308 79.13238
#> 5      Europe & Central Asia          Low income   311485944  1028.229 68.92227
#> 6      Europe & Central Asia Lower middle income  4511786205  3109.060 70.16087
#> 7      Europe & Central Asia Upper middle income 16972478305 10579.353 70.58594
#> 8  Latin America & Caribbean         High income  1466292826 13936.359 77.63032
#> 9  Latin America & Caribbean          Low income   429756890  1429.081 60.35597
#> 10 Latin America & Caribbean Lower middle income  1290800630  2248.974 71.20178
#>        GINI        ODA
#> 1  33.50000  755719050
#> 2  39.33432 1840884811
#> 3  42.40186 2582104924
#> 4  34.70000 1751293143
#> 5  33.59662  370935975
#> 6  29.83209  724985435
#> 7  41.20000 1799582651
#> 8  54.89047  158773677
#> 9  41.10000  968787392
#> 10 55.50000  660837650
#>  [ reached 'max' / getOption("max.print") -- omitted 13 rows ]

wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |>
  fwithin() |> head()                         # Within transformation
#>   PCGDP    LIFEEX GINI         ODA
#> 1    NA -16.75117   NA -1370778502
#> 2    NA -16.23517   NA -1255468497
#> 3    NA -15.72617   NA -1374708502
#> 4    NA -15.22617   NA -1249828497
#> 5    NA -14.73417   NA -1191628485
#> 6    NA -14.24917   NA -1145708502
wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |>
  fmedian(TRA = "-") |> head()                # Grouped centering using the median
#>   PCGDP   LIFEEX GINI        ODA
#> 1    NA -17.5395   NA -144765007
#> 2    NA -17.0235   NA  -29455002
#> 3    NA -16.5145   NA -148695007
#> 4    NA -16.0145   NA  -23815002
#> 5    NA -15.5225   NA   34385010
#> 6    NA -15.0375   NA   80304993
# Replacing data points by the weighted first quartile:
wlddev |> na_omit(cols = "POP") |> fgroup_by(country) |>
  fselect(country, year, PCGDP:POP) %>%
  ftransform(fselect(., -country, -year) |>
             fnth(0.25, POP, "fill")) |> head()
#>       country year    PCGDP   LIFEEX GINI       ODA     POP
#> 1 Afghanistan 1960 407.2632 45.91724   NA 237905880 8996973
#> 2 Afghanistan 1961 407.2632 45.91724   NA 237905880 9169410
#> 3 Afghanistan 1962 407.2632 45.91724   NA 237905880 9351441
#> 4 Afghanistan 1963 407.2632 45.91724   NA 237905880 9543205
#> 5 Afghanistan 1964 407.2632 45.91724   NA 237905880 9744781
#> 6 Afghanistan 1965 407.2632 45.91724   NA 237905880 9956320

wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |> fscale() |> head() # Standardizing
#>   PCGDP    LIFEEX GINI        ODA
#> 1    NA -1.653181   NA -0.6498451
#> 2    NA -1.602256   NA -0.5951801
#> 3    NA -1.552023   NA -0.6517082
#> 4    NA -1.502678   NA -0.5925063
#> 5    NA -1.454122   NA -0.5649154
#> 6    NA -1.406257   NA -0.5431461
wlddev |> fgroup_by(country) |> fselect(PCGDP:POP) |>
   fscale(POP) |> head()  # Weighted..
#>       POP PCGDP    LIFEEX GINI        ODA
#> 1 8996973    NA -2.172769   NA -0.9502811
#> 2 9169410    NA -2.119489   NA -0.9011481
#> 3 9351441    NA -2.066932   NA -0.9519557
#> 4 9543205    NA -2.015304   NA -0.8987449
#> 5 9744781    NA -1.964502   NA -0.8739462
#> 6 9956320    NA -1.914423   NA -0.8543799

wlddev |> fselect(country, year, PCGDP:ODA) |>  # Adding 1 lead and 2 lags of each variable
  fgroup_by(country) |> flag(-1:2, year) |> head()
#>       country year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX
#> 1 Afghanistan 1960       NA    NA       NA       NA    32.962 32.446        NA
#> 2 Afghanistan 1961       NA    NA       NA       NA    33.471 32.962    32.446
#> 3 Afghanistan 1962       NA    NA       NA       NA    33.971 33.471    32.962
#>   L2.LIFEEX F1.GINI GINI L1.GINI L2.GINI    F1.ODA       ODA    L1.ODA
#> 1        NA      NA   NA      NA      NA 232080002 116769997        NA
#> 2        NA      NA   NA      NA      NA 112839996 232080002 116769997
#> 3    32.446      NA   NA      NA      NA 237720001 112839996 232080002
#>      L2.ODA
#> 1        NA
#> 2        NA
#> 3 116769997
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
wlddev |> fselect(country, year, PCGDP:ODA) |>  # Adding 1 lead and 10-year growth rates
  fgroup_by(country) |> fgrowth(c(0:1,10), 1, year) |> head()
#>       country year PCGDP G1.PCGDP L10G1.PCGDP LIFEEX G1.LIFEEX L10G1.LIFEEX
#> 1 Afghanistan 1960    NA       NA          NA 32.446        NA           NA
#> 2 Afghanistan 1961    NA       NA          NA 32.962  1.590335           NA
#> 3 Afghanistan 1962    NA       NA          NA 33.471  1.544202           NA
#> 4 Afghanistan 1963    NA       NA          NA 33.971  1.493830           NA
#> 5 Afghanistan 1964    NA       NA          NA 34.463  1.448294           NA
#>   GINI G1.GINI L10G1.GINI       ODA    G1.ODA L10G1.ODA
#> 1   NA      NA         NA 116769997        NA        NA
#> 2   NA      NA         NA 232080002  98.74969        NA
#> 3   NA      NA         NA 112839996 -51.37884        NA
#> 4   NA      NA         NA 237720001 110.66998        NA
#> 5   NA      NA         NA 295920013  24.48259        NA
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

# etc...

# Aggregation with multiple functions
wlddev |> fsubset(year > 1990, region, income, PCGDP:ODA) |>
  fgroup_by(region, income) %>% {
    add_vars(fgroup_vars(., "unique"),
             fmedian(., keep.group_vars = FALSE) |> add_stub("median_"),
             fmean(., keep.group_vars = FALSE) |> add_stub("mean_"),
             fsd(., keep.group_vars = FALSE) |> add_stub("sd_"))
  } |> head()
#>                  region              income median_PCGDP median_LIFEEX
#> 1   East Asia & Pacific         High income   32573.8177      78.54024
#> 2   East Asia & Pacific Lower middle income    1658.3786      66.07200
#> 3   East Asia & Pacific Upper middle income    3583.2189      70.61500
#> 4 Europe & Central Asia         High income   36201.7707      78.16061
#> 5 Europe & Central Asia          Low income     668.9513      66.08000
#>   median_GINI median_ODA mean_PCGDP mean_LIFEEX mean_GINI  mean_ODA   sd_PCGDP
#> 1       32.75   11500000 32671.0522    78.21996  32.95000 112194118 13031.1867
#> 2       35.70  257079987  1738.5111    65.45647  36.51972 509965862   904.3004
#> 3       40.35   49730000  4575.8695    70.87431  40.64815 219146704  2489.3795
#> 4       31.10  138889999 40814.1215    77.67583  30.94218 321907195 29485.3091
#> 5       32.45  239055000   695.7951    65.04128  32.13333 244539286   242.4158
#>   sd_LIFEEX  sd_GINI    sd_ODA
#> 1  3.825737 1.322624 223580786
#> 2  5.003373 4.779528 686619373
#> 3  3.157915 3.506637 684346804
#> 4  3.810700 3.676878 632730086
#> 5  4.723263 1.713087 116363515
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

# Transformation with multiple functions
wlddev |> fselect(country, year, PCGDP:ODA) |>
  fgroup_by(country) %>% {
    add_vars(fdiff(., c(1,10), 1, year) |> flag(0:2, year),  # Sequence of lagged differences
             ftransform(., fselect(., PCGDP:ODA) |> fwithin() |> add_stub("W.")) |>
               flag(0:2, year, keep.ids = FALSE))            # Sequence of lagged demeaned vars
  } |> head()
#>       country year D1.PCGDP L1.D1.PCGDP L2.D1.PCGDP L10D1.PCGDP L1.L10D1.PCGDP
#> 1 Afghanistan 1960       NA          NA          NA          NA             NA
#>   L2.L10D1.PCGDP D1.LIFEEX L1.D1.LIFEEX L2.D1.LIFEEX L10D1.LIFEEX
#> 1             NA        NA           NA           NA           NA
#>   L1.L10D1.LIFEEX L2.L10D1.LIFEEX D1.GINI L1.D1.GINI L2.D1.GINI L10D1.GINI
#> 1              NA              NA      NA         NA         NA         NA
#>   L1.L10D1.GINI L2.L10D1.GINI D1.ODA L1.D1.ODA L2.D1.ODA L10D1.ODA L1.L10D1.ODA
#> 1            NA            NA     NA        NA        NA        NA           NA
#>   L2.L10D1.ODA PCGDP L1.PCGDP L2.PCGDP LIFEEX L1.LIFEEX L2.LIFEEX GINI L1.GINI
#> 1           NA    NA       NA       NA 32.446        NA        NA   NA      NA
#>   L2.GINI       ODA L1.ODA L2.ODA W.PCGDP L1.W.PCGDP L2.W.PCGDP  W.LIFEEX
#> 1      NA 116769997     NA     NA      NA         NA         NA -16.75117
#>   L1.W.LIFEEX L2.W.LIFEEX W.GINI L1.W.GINI L2.W.GINI       W.ODA L1.W.ODA
#> 1          NA          NA     NA        NA        NA -1370778502       NA
#>   L2.W.ODA
#> 1       NA
#>  [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

# With ftransform, can also easily do one or more grouped mutations on the fly..
settransform(wlddev, median_ODA = fmedian(ODA, list(region, income), TRA = "fill"))

settransform(wlddev, sd_ODA = fsd(ODA, list(region, income), TRA = "fill"),
                     mean_GDP = fmean(PCGDP, country, TRA = "fill"))

wlddev %<>% ftransform(fmedian(list(median_ODA = ODA, median_GDP = PCGDP),
                               list(region, income), TRA = "fill"))

# On a groped data frame it is also possible to grouped transform certain columns
# but perform aggregate operatins on others:
wlddev |> fgroup_by(region, income) %>%
    ftransform(gmedian_GDP = fmedian(PCGDP, GRP(.), TRA = "replace"),
               omedian_GDP = fmedian(PCGDP, TRA = "replace"),  # "replace" preserves NA's
               omedian_GDP_fill = fmedian(PCGDP)) |> tail()
#>        country iso3c       date year decade             region
#> 13171 Zimbabwe   ZWE 2016-01-01 2015   2010 Sub-Saharan Africa
#> 13172 Zimbabwe   ZWE 2017-01-01 2016   2010 Sub-Saharan Africa
#> 13173 Zimbabwe   ZWE 2018-01-01 2017   2010 Sub-Saharan Africa
#>                    income  OECD    PCGDP LIFEEX GINI       ODA      POP
#> 13171 Lower middle income FALSE 1234.103 59.534   NA 817729980 13814629
#> 13172 Lower middle income FALSE 1224.310 60.294   NA 687659973 14030390
#> 13173 Lower middle income FALSE 1263.321 60.812 44.3 753909973 14236745
#>       median_ODA    sd_ODA mean_GDP median_GDP gmedian_GDP omedian_GDP
#> 13171  280630005 694376321 1219.436   1336.053    1336.053    3767.162
#> 13172  280630005 694376321 1219.436   1336.053    1336.053    3767.162
#> 13173  280630005 694376321 1219.436   1336.053    1336.053    3767.162
#>       omedian_GDP_fill
#> 13171         3767.162
#> 13172         3767.162
#> 13173         3767.162
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

rm(wlddev)

## For multi-type data aggregation, the function collap() offers ease and flexibility
# Aggregate this data by country and decade: Numeric columns with mean, categorical with mode
head(collap(wlddev, ~ country + decade, fmean, fmode))
#>       country iso3c       date   year decade     region     income  OECD
#> 1 Afghanistan   AFG 1961-01-01 1964.5   1960 South Asia Low income FALSE
#> 2 Afghanistan   AFG 1971-01-01 1974.5   1970 South Asia Low income FALSE
#> 3 Afghanistan   AFG 1981-01-01 1984.5   1980 South Asia Low income FALSE
#> 4 Afghanistan   AFG 1991-01-01 1994.5   1990 South Asia Low income FALSE
#> 5 Afghanistan   AFG 2001-01-01 2004.5   2000 South Asia Low income FALSE
#>     PCGDP  LIFEEX GINI        ODA      POP
#> 1      NA 34.6908   NA  222288999  9886773
#> 2      NA 39.9053   NA  236169998 12451803
#> 3      NA 46.4176   NA   71666001 12291854
#> 4      NA 53.0097   NA  317255000 16931903
#> 5 379.373 58.0881   NA 3054051961 24870022
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

# taking weighted mean and weighted mode:
head(collap(wlddev, ~ country + decade, fmean, fmode, w = ~ POP, wFUN = fsum))
#>       country iso3c       date     year decade     region     income  OECD
#> 1 Afghanistan   AFG 1970-01-01 1964.675   1960 South Asia Low income FALSE
#> 2 Afghanistan   AFG 1980-01-01 1974.672   1970 South Asia Low income FALSE
#> 3 Afghanistan   AFG 1981-01-01 1984.364   1980 South Asia Low income FALSE
#> 4 Afghanistan   AFG 2000-01-01 1994.941   1990 South Asia Low income FALSE
#> 5 Afghanistan   AFG 2010-01-01 2004.788   2000 South Asia Low income FALSE
#>      PCGDP   LIFEEX GINI        ODA       POP
#> 1       NA 34.77716   NA  223006447  98867731
#> 2       NA 40.00367   NA  236798314 124518028
#> 3       NA 46.32098   NA   70613923 122918537
#> 4       NA 53.25897   NA  306818649 169319030
#> 5 382.5583 58.23630   NA 3240143310 248700217
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

# Multi-function aggregation of certain columns
head(collap(wlddev, ~ country + decade,
            list(fmean, fmedian, fsd),
            list(ffirst, flast), cols = c(3,9:12)))
#>       country ffirst.date flast.date decade fmean.PCGDP fmedian.PCGDP fsd.PCGDP
#> 1 Afghanistan  1961-01-01 1970-01-01   1960          NA            NA        NA
#> 2 Afghanistan  1971-01-01 1980-01-01   1970          NA            NA        NA
#> 3 Afghanistan  1981-01-01 1990-01-01   1980          NA            NA        NA
#> 4 Afghanistan  1991-01-01 2000-01-01   1990          NA            NA        NA
#>   fmean.LIFEEX fmedian.LIFEEX fsd.LIFEEX fmean.GINI fmedian.GINI fsd.GINI
#> 1      34.6908        34.7055   1.490964         NA           NA       NA
#> 2      39.9053        39.8430   1.738383         NA           NA       NA
#> 3      46.4176        46.4005   2.161460         NA           NA       NA
#> 4      53.0097        53.1200   1.695424         NA           NA       NA
#>   fmean.ODA fmedian.ODA   fsd.ODA
#> 1 222288999   234900002  80884369
#> 2 236169998   246509995  34241008
#> 3  71666001    48539999  72958531
#> 4 317255000   285175003 160500141
#>  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

# Customized Aggregation: Assign columns to functions
head(collap(wlddev, ~ country + decade,
            custom = list(fmean = 9:10, fsd = 9:12, flast = 3, ffirst = 6:8)))
#>       country flast.date decade ffirst.region ffirst.income ffirst.OECD
#> 1 Afghanistan 1970-01-01   1960    South Asia    Low income       FALSE
#> 2 Afghanistan 1980-01-01   1970    South Asia    Low income       FALSE
#> 3 Afghanistan 1990-01-01   1980    South Asia    Low income       FALSE
#> 4 Afghanistan 2000-01-01   1990    South Asia    Low income       FALSE
#> 5 Afghanistan 2010-01-01   2000    South Asia    Low income       FALSE
#>   fmean.PCGDP fsd.PCGDP fmean.LIFEEX fsd.LIFEEX fsd.GINI    fsd.ODA
#> 1          NA        NA      34.6908   1.490964       NA   80884369
#> 2          NA        NA      39.9053   1.738383       NA   34241008
#> 3          NA        NA      46.4176   2.161460       NA   72958531
#> 4          NA        NA      53.0097   1.695424       NA  160500141
#> 5     379.373  53.66524      58.0881   1.565630       NA 2013110021
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

# For grouped data frames use collapg
wlddev |> fsubset(year > 1990, country, region, income, PCGDP:ODA) |>
  fgroup_by(country) |> collapg(fmean, ffirst) |>
  ftransform(AMGDP = PCGDP > fmedian(PCGDP, list(region, income), TRA = "fill"),
             AMODA = ODA > fmedian(ODA, income, TRA = "replace_fill")) |> head()
#>          country                     region              income      PCGDP
#> 1    Afghanistan                 South Asia          Low income   483.8351
#> 2        Albania      Europe & Central Asia Upper middle income  3127.1510
#> 3        Algeria Middle East & North Africa Upper middle income  4056.4341
#> 4 American Samoa        East Asia & Pacific Upper middle income 10071.0659
#> 5        Andorra      Europe & Central Asia         High income 40768.8453
#> 6         Angola         Sub-Saharan Africa Lower middle income  2876.5065
#>     LIFEEX     GINI        ODA AMGDP AMODA
#> 1 58.32283       NA 2888193791 FALSE  TRUE
#> 2 75.19266 31.41111  343797587 FALSE  TRUE
#> 3 72.57717 31.45000  287459654 FALSE  TRUE
#> 4       NA       NA         NA  TRUE    NA
#> 5       NA       NA         NA  TRUE    NA
#> 6 51.59572 48.66667  412104483  TRUE FALSE

## Additional flexibility for data transformation tasks is offerend by tidy transformation operators
# Within-transformation (centering on overall mean)
head(W(wlddev, ~ country, cols = 9:12, mean = "overall.mean"))
#>       country W.PCGDP W.LIFEEX W.GINI      W.ODA
#> 1 Afghanistan      NA 47.54514     NA -916058371
#> 2 Afghanistan      NA 48.06114     NA -800748366
#> 3 Afghanistan      NA 48.57014     NA -919988371
#> 4 Afghanistan      NA 49.07014     NA -795108366
#> 5 Afghanistan      NA 49.56214     NA -736908354
#> 6 Afghanistan      NA 50.04714     NA -690988371
# Partialling out country and year fixed effects
head(HDW(wlddev, PCGDP + LIFEEX ~ qF(country) + qF(year)))
#>   HDW.PCGDP HDW.LIFEEX
#> 1 1578.6211 -1.3980224
#> 2 1412.8849 -1.1838196
#> 3  917.2033 -1.0547978
#> 4  627.8605 -0.8296048
#> 5  168.0458 -0.6683027
#> 6 -234.9535 -0.4708428
# Same, adding ODA as continuous regressor
head(HDW(wlddev, PCGDP + LIFEEX ~ qF(country) + qF(year) + ODA))
#>   HDW.PCGDP HDW.LIFEEX
#> 1 -324.3991 -1.1765307
#> 2 -439.5404 -0.9751559
#> 3 -598.9266 -0.7835446
#> 4  100.2175 -0.6186010
#> 5  -70.7664 -0.4966332
#> 6  330.3561 -0.2257800
# Standardizing (scaling and centering) by country
head(STD(wlddev, ~ country, cols = 9:12))
#>       country STD.PCGDP STD.LIFEEX STD.GINI    STD.ODA
#> 1 Afghanistan        NA  -1.653181       NA -0.6498451
#> 2 Afghanistan        NA  -1.602256       NA -0.5951801
#> 3 Afghanistan        NA  -1.552023       NA -0.6517082
#> 4 Afghanistan        NA  -1.502678       NA -0.5925063
#> 5 Afghanistan        NA  -1.454122       NA -0.5649154
#> 6 Afghanistan        NA  -1.406257       NA -0.5431461
# Computing 1 lead and 3 lags of the 4 series
head(L(wlddev, -1:3, ~ country, ~year, cols = 9:12))
#>       country year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP L3.PCGDP F1.LIFEEX LIFEEX
#> 1 Afghanistan 1960       NA    NA       NA       NA       NA    32.962 32.446
#> 2 Afghanistan 1961       NA    NA       NA       NA       NA    33.471 32.962
#> 3 Afghanistan 1962       NA    NA       NA       NA       NA    33.971 33.471
#>   L1.LIFEEX L2.LIFEEX L3.LIFEEX F1.GINI GINI L1.GINI L2.GINI L3.GINI    F1.ODA
#> 1        NA        NA        NA      NA   NA      NA      NA      NA 232080002
#> 2    32.446        NA        NA      NA   NA      NA      NA      NA 112839996
#> 3    32.962    32.446        NA      NA   NA      NA      NA      NA 237720001
#>         ODA    L1.ODA    L2.ODA L3.ODA
#> 1 116769997        NA        NA     NA
#> 2 232080002 116769997        NA     NA
#> 3 112839996 232080002 116769997     NA
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Computing the 1- and 10-year first differences
head(D(wlddev, c(1,10), 1, ~ country, ~year, cols = 9:12))
#>       country year D1.PCGDP L10D1.PCGDP D1.LIFEEX L10D1.LIFEEX D1.GINI
#> 1 Afghanistan 1960       NA          NA        NA           NA      NA
#> 2 Afghanistan 1961       NA          NA     0.516           NA      NA
#> 3 Afghanistan 1962       NA          NA     0.509           NA      NA
#> 4 Afghanistan 1963       NA          NA     0.500           NA      NA
#> 5 Afghanistan 1964       NA          NA     0.492           NA      NA
#> 6 Afghanistan 1965       NA          NA     0.485           NA      NA
#>   L10D1.GINI     D1.ODA L10D1.ODA
#> 1         NA         NA        NA
#> 2         NA  115310005        NA
#> 3         NA -119240005        NA
#> 4         NA  124880005        NA
#> 5         NA   58200012        NA
#> 6         NA   45919983        NA
head(D(wlddev, c(1,10), 1:2, ~ country, ~year, cols = 9:12))     # ..first and second differences
#>       country year D1.PCGDP D2.PCGDP L10D1.PCGDP L10D2.PCGDP D1.LIFEEX
#> 1 Afghanistan 1960       NA       NA          NA          NA        NA
#> 2 Afghanistan 1961       NA       NA          NA          NA     0.516
#> 3 Afghanistan 1962       NA       NA          NA          NA     0.509
#>   D2.LIFEEX L10D1.LIFEEX L10D2.LIFEEX D1.GINI D2.GINI L10D1.GINI L10D2.GINI
#> 1        NA           NA           NA      NA      NA         NA         NA
#> 2        NA           NA           NA      NA      NA         NA         NA
#> 3    -0.007           NA           NA      NA      NA         NA         NA
#>       D1.ODA     D2.ODA L10D1.ODA L10D2.ODA
#> 1         NA         NA        NA        NA
#> 2  115310005         NA        NA        NA
#> 3 -119240005 -234550011        NA        NA
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Computing the 1- and 10-year growth rates
head(G(wlddev, c(1,10), 1, ~ country, ~year, cols = 9:12))
#>       country year G1.PCGDP L10G1.PCGDP G1.LIFEEX L10G1.LIFEEX G1.GINI
#> 1 Afghanistan 1960       NA          NA        NA           NA      NA
#> 2 Afghanistan 1961       NA          NA  1.590335           NA      NA
#> 3 Afghanistan 1962       NA          NA  1.544202           NA      NA
#> 4 Afghanistan 1963       NA          NA  1.493830           NA      NA
#> 5 Afghanistan 1964       NA          NA  1.448294           NA      NA
#> 6 Afghanistan 1965       NA          NA  1.407306           NA      NA
#>   L10G1.GINI    G1.ODA L10G1.ODA
#> 1         NA        NA        NA
#> 2         NA  98.74969        NA
#> 3         NA -51.37884        NA
#> 4         NA 110.66998        NA
#> 5         NA  24.48259        NA
#> 6         NA  15.51770        NA
# Adding growth rate variables to dataset
add_vars(wlddev) <- G(wlddev, c(1, 10), 1, ~ country, ~year, cols = 9:12, keep.ids = FALSE)
get_vars(wlddev, "G1.", regex = TRUE) <- NULL # Deleting again

# These operators can conveniently be used in regression formulas:
# Using a Mundlak (1978) procedure to estimate the effect of OECD on LIFEEX, controlling for PCGDP
lm(LIFEEX ~ log(PCGDP) + OECD + B(log(PCGDP), country),
   wlddev |> fselect(country, OECD, PCGDP, LIFEEX) |> na_omit())
#> 
#> Call:
#> lm(formula = LIFEEX ~ log(PCGDP) + OECD + B(log(PCGDP), country), 
#>     data = na_omit(fselect(wlddev, country, OECD, PCGDP, LIFEEX)))
#> 
#> Coefficients:
#>            (Intercept)              log(PCGDP)                OECDTRUE  
#>               19.32590                 8.20551                 0.02478  
#> B(log(PCGDP), country)  
#>               -2.65428  
#> 

# Adding 10-year lagged life-expectancy to allow for some convergence effects (dynamic panel model)
lm(LIFEEX ~ L(LIFEEX, 10, country) + log(PCGDP) + OECD + B(log(PCGDP), country),
   wlddev |> fselect(country, OECD, PCGDP, LIFEEX) |> na_omit())
#> 
#> Call:
#> lm(formula = LIFEEX ~ L(LIFEEX, 10, country) + log(PCGDP) + OECD + 
#>     B(log(PCGDP), country), data = na_omit(fselect(wlddev, country, 
#>     OECD, PCGDP, LIFEEX)))
#> 
#> Coefficients:
#>            (Intercept)  L(LIFEEX, 10, country)              log(PCGDP)  
#>                 9.2756                  0.8656                  0.9229  
#>               OECDTRUE  B(log(PCGDP), country)  
#>                 0.4158                 -0.6581  
#> 

# Tranformation functions and operators also support indexed data classes:
wldi <- findex_by(wlddev, country, year)
head(W(wldi$PCGDP))                      # Country-demeaning
#> [1] NA NA NA NA NA NA
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(W(wldi, cols = 9:12))
#>       country year W.PCGDP  W.LIFEEX W.GINI       W.ODA
#> 1 Afghanistan 1960      NA -16.75117     NA -1370778502
#> 2 Afghanistan 1961      NA -16.23517     NA -1255468497
#> 3 Afghanistan 1962      NA -15.72617     NA -1374708502
#> 4 Afghanistan 1963      NA -15.22617     NA -1249828497
#> 5 Afghanistan 1964      NA -14.73417     NA -1191628485
#> 6 Afghanistan 1965      NA -14.24917     NA -1145708502
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(W(wldi$PCGDP, effect = 2))          # Time-demeaning
#> [1] NA NA NA NA NA NA
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(W(wldi, effect = 2, cols = 9:12))
#>       country year W.PCGDP  W.LIFEEX W.GINI      W.ODA
#> 1 Afghanistan 1960      NA -21.46606     NA -122241092
#> 2 Afghanistan 1961      NA -21.51241     NA  -37552049
#> 3 Afghanistan 1962      NA -21.38618     NA -183366702
#> 4 Afghanistan 1963      NA -21.23172     NA  -54896550
#> 5 Afghanistan 1964      NA -21.20502     NA   -9633789
#> 6 Afghanistan 1965      NA -21.18163     NA    5438669
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(HDW(wldi$PCGDP))                    # Country- and time-demeaning
#> [1] NA NA NA NA NA NA
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(HDW(wldi, cols = 9:12))
#>   HDW.PCGDP HDW.LIFEEX HDW.GINI     HDW.ODA
#> 1        NA  -6.706423       NA -1093922188
#> 2        NA  -6.688440       NA -1032355993
#> 3        NA  -6.562210       NA -1156945288
#> 4        NA  -6.472079       NA -1046169271
#> 5        NA  -6.445378       NA  -996348510
#> 6        NA  -6.367659       NA  -983277444
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(STD(wldi$PCGDP))                    # Standardizing by country
#> [1] NA NA NA NA NA NA
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(STD(wldi, cols = 9:12))
#>       country year STD.PCGDP STD.LIFEEX STD.GINI    STD.ODA
#> 1 Afghanistan 1960        NA  -1.653181       NA -0.6498451
#> 2 Afghanistan 1961        NA  -1.602256       NA -0.5951801
#> 3 Afghanistan 1962        NA  -1.552023       NA -0.6517082
#> 4 Afghanistan 1963        NA  -1.502678       NA -0.5925063
#> 5 Afghanistan 1964        NA  -1.454122       NA -0.5649154
#> 6 Afghanistan 1965        NA  -1.406257       NA -0.5431461
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(L(wldi$PCGDP, -1:3))                # Panel-lags
#>      F1 -- L1 L2 L3
#> [1,] NA NA NA NA NA
#> [2,] NA NA NA NA NA
#> [3,] NA NA NA NA NA
#> [4,] NA NA NA NA NA
#> [5,] NA NA NA NA NA
#> [6,] NA NA NA NA NA
#> attr(,"class")
#> [1] "matrix" "array" 
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(L(wldi, -1:3, 9:12))
#>       country year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP L3.PCGDP F1.LIFEEX LIFEEX
#> 1 Afghanistan 1960       NA    NA       NA       NA       NA    32.962 32.446
#> 2 Afghanistan 1961       NA    NA       NA       NA       NA    33.471 32.962
#> 3 Afghanistan 1962       NA    NA       NA       NA       NA    33.971 33.471
#>   L1.LIFEEX L2.LIFEEX L3.LIFEEX F1.GINI GINI L1.GINI L2.GINI L3.GINI    F1.ODA
#> 1        NA        NA        NA      NA   NA      NA      NA      NA 232080002
#> 2    32.446        NA        NA      NA   NA      NA      NA      NA 112839996
#> 3    32.962    32.446        NA      NA   NA      NA      NA      NA 237720001
#>         ODA    L1.ODA    L2.ODA L3.ODA
#> 1 116769997        NA        NA     NA
#> 2 232080002 116769997        NA     NA
#> 3 112839996 232080002 116769997     NA
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(G(wldi$PCGDP))                      # Panel-Growth rates
#> [1] NA NA NA NA NA NA
#> 
#> Indexed by:  country [1] | year [6 (61)] 
head(G(wldi, 1, 1, 9:12))
#>       country year G1.PCGDP G1.LIFEEX G1.GINI    G1.ODA
#> 1 Afghanistan 1960       NA        NA      NA        NA
#> 2 Afghanistan 1961       NA  1.590335      NA  98.74969
#> 3 Afghanistan 1962       NA  1.544202      NA -51.37884
#> 4 Afghanistan 1963       NA  1.493830      NA 110.66998
#> 5 Afghanistan 1964       NA  1.448294      NA  24.48259
#> 6 Afghanistan 1965       NA  1.407306      NA  15.51770
#> 
#> Indexed by:  country [1] | year [6 (61)] 

lm(Dlog(PCGDP) ~ L(Dlog(LIFEEX), 0:3), wldi)   # Panel data regression
#> 
#> Call:
#> lm(formula = Dlog(PCGDP) ~ L(Dlog(LIFEEX), 0:3), data = wldi)
#> 
#> Coefficients:
#>            (Intercept)  L(Dlog(LIFEEX), 0:3)--  L(Dlog(LIFEEX), 0:3)L1  
#>                0.01544                -0.12618                 0.38523  
#> L(Dlog(LIFEEX), 0:3)L2  L(Dlog(LIFEEX), 0:3)L3  
#>                0.54179                -0.16475  
#> 
rm(wldi)

# Remove all objects used in this example section
rm(v, d, w, f, f1, f2, g, mtcarsM, sds, series, wlddev)