Advanced and Fast Data Transformation
collapse-package.Rd
collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are:
To facilitate complex data transformation, exploration and computing tasks in R.
To help make R code fast, flexible, parsimonious and programmer friendly.
It is made compatible with the tidyverse, data.table, sf, units, xts/zoo, and the plm approach to panel data.
Getting Started
Read the short vignette on documentation resources, and check out the built in documentation.
Details
collapse provides an integrated suite of statistical and data manipulation functions that greatly extend and enhance the capabilities of base R. In a nutshell, collapse provides:
Fast C/C++ based (grouped, weighted) computations embedded in highly optimized R code.
More complex statistical, time series / panel data and recursive (list-processing) operations.
A flexible and generic approach supporting and preserving many R objects.
Optimized programming in standard and non-standard evaluation.
The statistical functions in collapse are S3 generic with core methods for vectors, matrices and data frames, and internally support grouped and weighted computations carried out in C/C++.
Additional methods and C-level features enable broad based compatibility with dplyr (grouped tibble), data.table, sf and plm panel data classes. Functions and core methods seek to preserve object attributes (including column attributes such as variable labels), ensuring flexibility and effective workflows with a very broad range of R objects (including most time-series classes). See also the vignette on collapse's handling of R objects.
Missing values are efficiently skipped at C/C++ level. The package default is na.rm = TRUE
. This can be changed using set_collapse(na.rm = FALSE)
. Missing weights are generally supported.
collapse installs with a built-in hierarchical documentation facilitating the use of the package.
The package is coded both in C and C++ and built with Rcpp, but also uses C/C++ functions from data.table, kit, fixest, weights, stats and RcppArmadillo / RcppEigen.
Author(s)
Maintainer: Sebastian Krantz sebastian.krantz@graduateinstitute.ch
Other contributors from packages collapse utilizes:
Matt Dowle, Arun Srinivasan and contributors worldwide (data.table)
Dirk Eddelbuettel and contributors worldwide (Rcpp, RcppArmadillo, RcppEigen)
Morgan Jacob (kit)
Laurent Berge (fixest)
Josh Pasek (weights)
R Core Team and contributors worldwide (stats)
I thank many people from diverse fields for helpful answers on Stackoverflow, Joris Meys for encouraging me and helping to set up the GitHub repository for collapse, and many other people for feature requests and helpful suggestions.
Developing / Bug Reporting
Please report issues at https://github.com/SebKrantz/collapse/issues.
Please send pull-requests to the 'development' branch of the repository.
Examples
## Note: this set of examples is is certainly non-exhaustive and does not
## showcase many recent features, but remains a very good starting point
## Let's start with some statistical programming
v <- iris$Sepal.Length
d <- num_vars(iris) # Saving numeric variables
f <- iris$Species # Factor
# Simple statistics
fmean(v) # vector
#> [1] 5.843333
fmean(qM(d)) # matrix (qM is a faster as.matrix)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 5.843333 3.057333 3.758000 1.199333
fmean(d) # data.frame
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 5.843333 3.057333 3.758000 1.199333
# Preserving data structure
fmean(qM(d), drop = FALSE) # Still a matrix
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 5.843333 3.057333 3.758 1.199333
fmean(d, drop = FALSE) # Still a data.frame
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.843333 3.057333 3.758 1.199333
# Weighted statistics, supported by most functions...
w <- abs(rnorm(fnrow(iris)))
fmean(d, w = w)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 5.839603 3.086031 3.721517 1.199056
# Grouped statistics...
fmean(d, f)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa 5.006 3.428 1.462 0.246
#> versicolor 5.936 2.770 4.260 1.326
#> virginica 6.588 2.974 5.552 2.026
# Groupwise-weighted statistics...
fmean(d, f, w)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa 5.035771 3.447914 1.453903 0.2653375
#> versicolor 5.930956 2.795360 4.255436 1.3213342
#> virginica 6.578510 2.990917 5.543057 2.0419757
# Simple Transformations...
head(fmode(d, TRA = "replace")) # Replacing values with the mode
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5 3 1.5 0.2
#> 2 5 3 1.5 0.2
#> 3 5 3 1.5 0.2
#> 4 5 3 1.5 0.2
#> 5 5 3 1.5 0.2
#> 6 5 3 1.5 0.2
head(fmedian(d, TRA = "-")) # Subtracting the median
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 -0.7 0.5 -2.95 -1.1
#> 2 -0.9 0.0 -2.95 -1.1
#> 3 -1.1 0.2 -3.05 -1.1
#> 4 -1.2 0.1 -2.85 -1.1
#> 5 -0.8 0.6 -2.95 -1.1
#> 6 -0.4 0.9 -2.65 -0.9
head(fsum(d, TRA = "%")) # Computing percentages
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.5818597 0.7631923 0.2483591 0.1111729
#> 2 0.5590416 0.6541648 0.2483591 0.1111729
#> 3 0.5362236 0.6977758 0.2306191 0.1111729
#> 4 0.5248146 0.6759703 0.2660990 0.1111729
#> 5 0.5704507 0.7849978 0.2483591 0.1111729
#> 6 0.6160867 0.8504143 0.3015789 0.2223457
head(fsd(d, TRA = "/")) # Dividing by the standard-deviation (scaling), etc...
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 6.158928 8.029986 0.7930671 0.2623854
#> 2 5.917402 6.882845 0.7930671 0.2623854
#> 3 5.675875 7.341701 0.7364195 0.2623854
#> 4 5.555112 7.112273 0.8497148 0.2623854
#> 5 6.038165 8.259414 0.7930671 0.2623854
#> 6 6.521218 8.947698 0.9630101 0.5247707
# Weighted Transformations...
head(fnth(d, 0.75, w = w, TRA = "replace")) # Replacing by the weighted 3rd quartile
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 6.4 3.4 5.1 1.8
#> 2 6.4 3.4 5.1 1.8
#> 3 6.4 3.4 5.1 1.8
#> 4 6.4 3.4 5.1 1.8
#> 5 6.4 3.4 5.1 1.8
#> 6 6.4 3.4 5.1 1.8
# Grouped Transformations...
head(fvar(d, f, TRA = "replace")) # Replacing values with the group variance
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.124249 0.1436898 0.03015918 0.01110612
#> 2 0.124249 0.1436898 0.03015918 0.01110612
#> 3 0.124249 0.1436898 0.03015918 0.01110612
#> 4 0.124249 0.1436898 0.03015918 0.01110612
#> 5 0.124249 0.1436898 0.03015918 0.01110612
#> 6 0.124249 0.1436898 0.03015918 0.01110612
head(fsd(d, f, TRA = "/")) # Grouped scaling
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 14.46851 9.233260 8.061544 1.897793
#> 2 13.90112 7.914223 8.061544 1.897793
#> 3 13.33372 8.441838 7.485720 1.897793
#> 4 13.05003 8.178031 8.637369 1.897793
#> 5 14.18481 9.497068 8.061544 1.897793
#> 6 15.31960 10.288490 9.789018 3.795585
head(fmin(d, f, TRA = "-")) # Setting the minimum value in each species to 0
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.8 1.2 0.4 0.1
#> 2 0.6 0.7 0.4 0.1
#> 3 0.4 0.9 0.3 0.1
#> 4 0.3 0.8 0.5 0.1
#> 5 0.7 1.3 0.4 0.1
#> 6 1.1 1.6 0.7 0.3
head(fsum(d, f, TRA = "/")) # Dividing by the sum (proportions)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.02037555 0.02042007 0.01915185 0.01626016
#> 2 0.01957651 0.01750292 0.01915185 0.01626016
#> 3 0.01877747 0.01866978 0.01778386 0.01626016
#> 4 0.01837795 0.01808635 0.02051984 0.01626016
#> 5 0.01997603 0.02100350 0.01915185 0.01626016
#> 6 0.02157411 0.02275379 0.02325581 0.03252033
head(fmedian(d, f, TRA = "-")) # Groupwise de-median
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.1 0.1 -0.1 0.0
#> 2 -0.1 -0.4 -0.1 0.0
#> 3 -0.3 -0.2 -0.2 0.0
#> 4 -0.4 -0.3 0.0 0.0
#> 5 0.0 0.2 -0.1 0.0
#> 6 0.4 0.5 0.2 0.2
head(ffirst(d, f, TRA = "%%")) # Taking modulus of first group-value, etc. ...
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.0 0.0 0.0 0
#> 2 4.9 3.0 0.0 0
#> 3 4.7 3.2 1.3 0
#> 4 4.6 3.1 0.1 0
#> 5 5.0 0.1 0.0 0
#> 6 0.3 0.4 0.3 0
# Grouped and weighted transformations...
head(fsd(d, f, w, "/"), 3) # weighted scaling
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 14.24438 9.301969 8.285370 1.776243
#> 2 13.68577 7.973117 8.285370 1.776243
#> 3 13.12717 8.504658 7.693558 1.776243
head(fmedian(d, f, w, "-"), 3) # subtracting the weighted group-median
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.1 0.1 -0.1 0
#> 2 -0.1 -0.4 -0.1 0
#> 3 -0.3 -0.2 -0.2 0
head(fmode(d, f, w, "replace"), 3) # replace with weighted statistical mode
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5 3.4 1.4 0.2
#> 2 5 3.4 1.4 0.2
#> 3 5 3.4 1.4 0.2
## Some more advanced transformations...
head(fbetween(d)) # Averaging (faster t.: fmean(d, TRA = "replace"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.843333 3.057333 3.758 1.199333
#> 2 5.843333 3.057333 3.758 1.199333
#> 3 5.843333 3.057333 3.758 1.199333
#> 4 5.843333 3.057333 3.758 1.199333
#> 5 5.843333 3.057333 3.758 1.199333
#> 6 5.843333 3.057333 3.758 1.199333
head(fwithin(d)) # Centering (faster than: fmean(d, TRA = "-"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 -0.7433333 0.44266667 -2.358 -0.9993333
#> 2 -0.9433333 -0.05733333 -2.358 -0.9993333
#> 3 -1.1433333 0.14266667 -2.458 -0.9993333
#> 4 -1.2433333 0.04266667 -2.258 -0.9993333
#> 5 -0.8433333 0.54266667 -2.358 -0.9993333
#> 6 -0.4433333 0.84266667 -2.058 -0.7993333
head(fwithin(d, f, w)) # Grouped and weighted (same as fmean(d, f, w, "-"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.06422856 0.05208646 -0.05390312 -0.06533746
#> 2 -0.13577144 -0.44791354 -0.05390312 -0.06533746
#> 3 -0.33577144 -0.24791354 -0.15390312 -0.06533746
#> 4 -0.43577144 -0.34791354 0.04609688 -0.06533746
#> 5 -0.03577144 0.15208646 -0.05390312 -0.06533746
#> 6 0.36422856 0.45208646 0.24609688 0.13466254
head(fwithin(d, f, w, mean = 5)) # Setting a custom mean
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.064229 5.052086 4.946097 4.934663
#> 2 4.864229 4.552086 4.946097 4.934663
#> 3 4.664229 4.752086 4.846097 4.934663
#> 4 4.564229 4.652086 5.046097 4.934663
#> 5 4.964229 5.152086 4.946097 4.934663
#> 6 5.364229 5.452086 5.246097 5.134663
head(fwithin(d, f, w, theta = 0.76)) # Quasi-centering i.e. d - theta*fbetween(d, f, w)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 1.2728137 0.8795857 0.2950336 -0.00165647
#> 2 1.0728137 0.3795857 0.2950336 -0.00165647
#> 3 0.8728137 0.5795857 0.1950336 -0.00165647
#> 4 0.7728137 0.4795857 0.3950336 -0.00165647
#> 5 1.1728137 0.9795857 0.2950336 -0.00165647
#> 6 1.5728137 1.2795857 0.5950336 0.19834353
head(fwithin(d, f, w, mean = "overall.mean")) # Preserving the overall mean of the data
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.903831 3.138117 3.667614 1.133719
#> 2 5.703831 2.638117 3.667614 1.133719
#> 3 5.503831 2.838117 3.567614 1.133719
#> 4 5.403831 2.738117 3.767614 1.133719
#> 5 5.803831 3.238117 3.667614 1.133719
#> 6 6.203831 3.538117 3.967614 1.333719
head(fscale(d)) # Scaling and centering
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 -0.8976739 1.01560199 -1.335752 -1.311052
#> 2 -1.1392005 -0.13153881 -1.335752 -1.311052
#> 3 -1.3807271 0.32731751 -1.392399 -1.311052
#> 4 -1.5014904 0.09788935 -1.279104 -1.311052
#> 5 -1.0184372 1.24503015 -1.335752 -1.311052
#> 6 -0.5353840 1.93331463 -1.165809 -1.048667
head(fscale(d, mean = 5, sd = 3)) # Custom scaling and centering
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 2.3069784 8.046806 0.9927451 1.066844
#> 2 1.5823985 4.605384 0.9927451 1.066844
#> 3 0.8578187 5.981953 0.8228021 1.066844
#> 4 0.4955288 5.293668 1.1626881 1.066844
#> 5 1.9446885 8.735090 0.9927451 1.066844
#> 6 3.3938481 10.799944 1.5025740 1.854000
head(fscale(d, mean = FALSE, sd = 3)) # Mean preserving scaling
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 3.150312 6.104139 -0.24925490 -2.733823
#> 2 2.425732 2.662717 -0.24925490 -2.733823
#> 3 1.701152 4.039286 -0.41919786 -2.733823
#> 4 1.338862 3.351001 -0.07931195 -2.733823
#> 5 2.788022 6.792424 -0.24925490 -2.733823
#> 6 4.237181 8.857277 0.26057397 -1.946667
head(fscale(d, f, w)) # Grouped and weighted scaling and centering
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.17939134 0.1384305 -0.3190052 -0.580276
#> 2 -0.37921170 -1.1904223 -0.3190052 -0.580276
#> 3 -0.93781474 -0.6588812 -0.9108174 -0.580276
#> 4 -1.21711626 -0.9246517 0.2728070 -0.580276
#> 5 -0.09991018 0.4042010 -0.3190052 -0.580276
#> 6 1.01729590 1.2015127 1.4564313 1.195967
head(fscale(d, f, w, mean = 5, sd = 3)) # Custom grouped and weighted scaling and centering
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.538174 5.415291 4.042984 3.259172
#> 2 3.862365 1.428733 4.042984 3.259172
#> 3 2.186556 3.023356 2.267548 3.259172
#> 4 1.348651 2.226045 5.818421 3.259172
#> 5 4.700269 6.212603 4.042984 3.259172
#> 6 8.051888 8.604538 9.369294 8.587901
head(fscale(d, f, w, mean = FALSE, # Preserving group means
sd = "within.sd")) # and setting group-sd to fsd(fwithin(d, f, w), w = w)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.124988 3.495163 1.323723 0.1488517
#> 2 4.847178 3.041600 1.323723 0.1488517
#> 3 4.569368 3.223025 1.082215 0.1488517
#> 4 4.430462 3.132312 1.565231 0.1488517
#> 5 4.986083 3.585875 1.323723 0.1488517
#> 6 5.541704 3.858013 2.048247 0.5054182
head(fscale(d, f, w, mean = "overall.mean", # Full harmonization of group means and variances,
sd = "within.sd")) # while preserving the level and scale of the data.
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.928819 3.133280 3.591337 1.082571
#> 2 5.651009 2.679717 3.591337 1.082571
#> 3 5.373199 2.861142 3.349829 1.082571
#> 4 5.234294 2.770429 3.832845 1.082571
#> 5 5.789914 3.223992 3.591337 1.082571
#> 6 6.345535 3.496130 4.315861 1.439137
head(get_vars(iris, 1:2)) # Use get_vars for fast selecting, gv is shortcut
#> Sepal.Length Sepal.Width
#> 1 5.1 3.5
#> 2 4.9 3.0
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5.0 3.6
#> 6 5.4 3.9
head(fhdbetween(gv(iris, 1:2), gv(iris, 3:5))) # Linear prediction with factors and covariates
#> Sepal.Length Sepal.Width
#> 1 4.950107 3.389732
#> 2 4.950107 3.389732
#> 3 4.859513 3.374264
#> 4 5.040702 3.405199
#> 5 4.950107 3.389732
#> 6 5.220692 3.560823
head(fhdwithin(gv(iris, 1:2), gv(iris, 3:5))) # Linear partialling out factors and covariates
#> Sepal.Length Sepal.Width
#> 1 0.14989286 0.1102684
#> 2 -0.05010714 -0.3897316
#> 3 -0.15951256 -0.1742640
#> 4 -0.44070173 -0.3051992
#> 5 0.04989286 0.2102684
#> 6 0.17930818 0.3391766
ss(iris, 1:10, 1:2) # Similarly fsubset/ss for fast subsetting rows
#> Sepal.Length Sepal.Width
#> 1 5.1 3.5
#> 2 4.9 3.0
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5.0 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5.0 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
# Simple Time-Computations..
head(flag(AirPassengers, -1:3)) # One lead and three lags
#> F1 -- L1 L2 L3
#> [1,] 118 112 NA NA NA
#> [2,] 132 118 112 NA NA
#> [3,] 129 132 118 112 NA
#> [4,] 121 129 132 118 112
#> [5,] 135 121 129 132 118
#> [6,] 148 135 121 129 132
head(fdiff(EuStockMarkets, # Suitably lagged first and second differences
c(1, frequency(EuStockMarkets)), diff = 1:2))
#> D1.DAX D2.DAX L260D1.DAX L260D2.DAX D1.SMI D2.SMI L260D1.SMI L260D2.SMI
#> [1,] NA NA NA NA NA NA NA NA
#> [2,] -15.12 NA NA NA 10.4 NA NA NA
#> [3,] -7.12 8.00 NA NA -9.9 -20.3 NA NA
#> [4,] 14.53 21.65 NA NA 5.5 15.4 NA NA
#> D1.CAC D2.CAC L260D1.CAC L260D2.CAC D1.FTSE D2.FTSE L260D1.FTSE
#> [1,] NA NA NA NA NA NA NA
#> [2,] -22.3 NA NA NA 16.6 NA NA
#> [3,] -32.5 -10.2 NA NA -12.0 -28.6 NA
#> [4,] -9.9 22.6 NA NA 22.2 34.2 NA
#> L260D2.FTSE
#> [1,] NA
#> [2,] NA
#> [3,] NA
#> [4,] NA
#> [ reached getOption("max.print") -- omitted 2 rows ]
head(fdiff(EuStockMarkets, rho = 0.87)) # Quasi-differences (x_t - rho*x_t-1)
#> DAX SMI CAC FTSE
#> [1,] NA NA NA NA
#> [2,] 196.6175 228.553 208.164 334.268
#> [3,] 202.6519 209.605 195.065 307.826
#> [4,] 223.3763 223.718 213.440 340.466
#> [5,] 207.8552 221.433 237.053 335.452
#> [6,] 202.8108 204.258 215.203 305.111
head(fdiff(EuStockMarkets, log = TRUE)) # Log-differences
#> DAX SMI CAC FTSE
#> [1,] NA NA NA NA
#> [2,] -0.009326550 0.006178360 -0.012658756 0.006770286
#> [3,] -0.004422175 -0.005880448 -0.018740638 -0.004889587
#> [4,] 0.009003794 0.003271184 -0.005779182 0.009027020
#> [5,] -0.001778217 0.001483372 0.008743353 0.005771847
#> [6,] -0.004676712 -0.008933417 -0.005120160 -0.007230164
head(fgrowth(EuStockMarkets)) # Exact growth rates (percentage change)
#> DAX SMI CAC FTSE
#> [1,] NA NA NA NA
#> [2,] -0.9283193 0.6197485 -1.2578971 0.6793256
#> [3,] -0.4412412 -0.5863192 -1.8566124 -0.4877652
#> [4,] 0.9044450 0.3276540 -0.5762515 0.9067887
#> [5,] -0.1776637 0.1484472 0.8781687 0.5788536
#> [6,] -0.4665793 -0.8893632 -0.5107074 -0.7204089
head(fgrowth(EuStockMarkets, logdiff = TRUE)) # Log-difference growth rates (percentage change)
#> DAX SMI CAC FTSE
#> [1,] NA NA NA NA
#> [2,] -0.9326550 0.6178360 -1.2658756 0.6770286
#> [3,] -0.4422175 -0.5880448 -1.8740638 -0.4889587
#> [4,] 0.9003794 0.3271184 -0.5779182 0.9027020
#> [5,] -0.1778217 0.1483372 0.8743353 0.5771847
#> [6,] -0.4676712 -0.8933417 -0.5120160 -0.7230164
# Note that it is not necessary to use factors for grouping.
fmean(gv(mtcars, -c(2,8:9)), mtcars$cyl) # Can also use vector (internally converted using qF())
#> mpg disp hp drat wt qsec gear carb
#> 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 4.090909 1.545455
#> 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 3.857143 3.428571
#> 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 3.285714 3.500000
fmean(gv(mtcars, -c(2,8:9)),
gv(mtcars, c(2,8:9))) # or a list of vector (internally grouped using GRP())
#> mpg disp hp drat wt qsec gear carb
#> 4.0.1 26.00000 120.3000 91.00000 4.430000 2.140000 16.70000 5.000000 2.000000
#> 4.1.0 22.90000 135.8667 84.66667 3.770000 2.935000 20.97000 3.666667 1.666667
#> 4.1.1 28.37143 89.8000 80.57143 4.148571 2.028286 18.70000 4.142857 1.428571
#> 6.0.1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667
#> 6.1.0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000
#> 8.0.0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333
#> 8.0.1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000
g <- GRP(mtcars, ~ cyl + vs + am) # It is also possible to create grouping objects
print(g) # These are instructive to learn about the grouping,
#> collapse grouping object of length 32 with 7 ordered groups
#>
#> Call: GRP.default(X = mtcars, by = ~cyl + vs + am), X is unsorted
#>
#> Distribution of group sizes:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 2.500 3.000 4.571 5.500 12.000
#>
#> Groups with sizes:
#> 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1
#> 1 3 7 3 4 12 2
plot(g) # and are directly handed down to C++ code
fmean(gv(mtcars, -c(2,8:9)), g) # This can speed up multiple computations over same groups
#> mpg disp hp drat wt qsec gear carb
#> 4.0.1 26.00000 120.3000 91.00000 4.430000 2.140000 16.70000 5.000000 2.000000
#> 4.1.0 22.90000 135.8667 84.66667 3.770000 2.935000 20.97000 3.666667 1.666667
#> 4.1.1 28.37143 89.8000 80.57143 4.148571 2.028286 18.70000 4.142857 1.428571
#> 6.0.1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667
#> 6.1.0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000
#> 8.0.0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333
#> 8.0.1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000
fsd(gv(mtcars, -c(2,8:9)), g)
#> mpg disp hp drat wt qsec gear
#> 4.0.1 NA NA NA NA NA NA NA
#> 4.1.0 1.4525839 13.969371 19.65536 0.1300000 0.4075230 1.67143651 0.5773503
#> 4.1.1 4.7577005 18.802128 24.14441 0.3783926 0.4400840 0.94546285 0.3779645
#> 6.0.1 0.7505553 8.660254 37.52777 0.1616581 0.1281601 0.76872188 0.5773503
#> 6.1.0 1.6317169 44.742634 9.17878 0.5919459 0.1162164 0.81590441 0.5773503
#> 8.0.0 2.7743959 71.823494 33.35984 0.2302749 0.7683069 0.80164745 0.0000000
#> 8.0.1 0.5656854 35.355339 50.20458 0.4808326 0.2828427 0.07071068 0.0000000
#> carb
#> 4.0.1 NA
#> 4.1.0 0.5773503
#> 4.1.1 0.5345225
#> 6.0.1 1.1547005
#> 6.1.0 1.7320508
#> 8.0.0 0.9003366
#> 8.0.1 2.8284271
# Factors can efficiently be created using qF()
f1 <- qF(mtcars$cyl) # Unlike GRP objects, factors are checked for NA's
f2 <- qF(mtcars$cyl, na.exclude = FALSE) # This can however be avoided through this option
class(f2) # Note the added class
#> [1] "factor" "na.included"
library(microbenchmark)
microbenchmark(fmean(mtcars, f1), fmean(mtcars, f2)) # A minor difference, larger on larger data
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fmean(mtcars, f1) 3.813 4.059 4.72156 4.346 4.6125 35.547 100
#> fmean(mtcars, f2) 3.690 3.895 4.18446 4.100 4.4280 5.330 100
with(mtcars, finteraction(cyl, vs, am)) # Efficient interactions of vectors and/or factors
#> [1] 6.0.1 6.0.1 4.1.1 6.1.0 8.0.0 6.1.0 8.0.0 4.1.0 4.1.0 6.1.0 6.1.0 8.0.0
#> [13] 8.0.0 8.0.0 8.0.0 8.0.0 8.0.0 4.1.1 4.1.1 4.1.1 4.1.0 8.0.0 8.0.0 8.0.0
#> [25] 8.0.0 4.1.1 4.0.1 4.1.1 8.0.1 6.0.1 8.0.1 4.1.1
#> Levels: 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1
finteraction(gv(mtcars, c(2,8:9))) # .. or lists of vectors/factors
#> [1] 6.0.1 6.0.1 4.1.1 6.1.0 8.0.0 6.1.0 8.0.0 4.1.0 4.1.0 6.1.0 6.1.0 8.0.0
#> [13] 8.0.0 8.0.0 8.0.0 8.0.0 8.0.0 4.1.1 4.1.1 4.1.1 4.1.0 8.0.0 8.0.0 8.0.0
#> [25] 8.0.0 4.1.1 4.0.1 4.1.1 8.0.1 6.0.1 8.0.1 4.1.1
#> Levels: 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1
# Simple row- or column-wise computations on matrices or data frames with dapply()
dapply(mtcars, quantile) # column quantiles
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 0% 10.400 4 71.100 52.0 2.760 1.51300 14.5000 0 0 3 1
#> 25% 15.425 4 120.825 96.5 3.080 2.58125 16.8925 0 0 3 2
#> 50% 19.200 6 196.300 123.0 3.695 3.32500 17.7100 0 0 4 2
#> 75% 22.800 8 326.000 180.0 3.920 3.61000 18.9000 1 1 4 4
#> 100% 33.900 8 472.000 335.0 4.930 5.42400 22.9000 1 1 5 8
dapply(mtcars, quantile, MARGIN = 1) # Row-quantiles
#> 0% 25% 50% 75% 100%
#> Mazda RX4 0 3.2600 4.000 18.730 160.0
#> Mazda RX4 Wag 0 3.3875 4.000 19.010 160.0
#> Datsun 710 1 1.6600 4.000 20.705 108.0
#> Hornet 4 Drive 0 2.0000 3.215 20.420 258.0
#> Hornet Sportabout 0 2.5000 3.440 17.860 360.0
#> Valiant 0 1.8800 3.460 19.160 225.0
#> Duster 360 0 3.1050 4.000 15.070 360.0
#> Merc 240D 0 2.5950 4.000 22.200 146.7
#> Merc 230 0 2.5750 4.000 22.850 140.8
#> Merc 280 0 3.6800 4.000 18.750 167.6
#> Merc 280C 0 3.6800 4.000 18.350 167.6
#> Merc 450SE 0 3.0000 4.070 16.900 275.8
#> Merc 450SL 0 3.0000 3.730 17.450 275.8
#> Merc 450SLC 0 3.0000 3.780 16.600 275.8
#> [ reached 'max' / getOption("max.print") -- omitted 18 rows ]
# dapply preserves the data structure of any matrices / data frames passed
# Some fast matrix row/column functions are also provided by the matrixStats package
# Similarly, BY performs grouped comptations
BY(mtcars, f2, quantile)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 4.0% 21.4 4 71.10 52.0 3.690 1.5130 16.70 0 0.0 3 1
#> 4.25% 22.8 4 78.85 65.5 3.810 1.8850 18.56 1 0.5 4 1
#> 4.50% 26.0 4 108.00 91.0 4.080 2.2000 18.90 1 1.0 4 2
#> 4.75% 30.4 4 120.65 96.0 4.165 2.6225 19.95 1 1.0 4 2
#> 4.100% 33.9 4 146.70 113.0 4.930 3.1900 22.90 1 1.0 5 2
#> 6.0% 17.8 6 145.00 105.0 2.760 2.6200 15.50 0 0.0 3 1
#> [ reached 'max' / getOption("max.print") -- omitted 9 rows ]
BY(mtcars, f2, quantile, expand.wide = TRUE)
#> mpg.0% mpg.25% mpg.50% mpg.75% mpg.100% cyl.0% cyl.25% cyl.50% cyl.75%
#> 4 21.4 22.8 26 30.4 33.9 4 4 4 4
#> cyl.100% disp.0% disp.25% disp.50% disp.75% disp.100% hp.0% hp.25% hp.50%
#> 4 4 71.1 78.85 108 120.65 146.7 52 65.5 91
#> hp.75% hp.100% drat.0% drat.25% drat.50% drat.75% drat.100% wt.0% wt.25%
#> 4 96 113 3.69 3.81 4.08 4.165 4.93 1.513 1.885
#> wt.50% wt.75% wt.100% qsec.0% qsec.25% qsec.50% qsec.75% qsec.100% vs.0%
#> 4 2.2 2.6225 3.19 16.7 18.56 18.9 19.95 22.9 0
#> vs.25% vs.50% vs.75% vs.100% am.0% am.25% am.50% am.75% am.100% gear.0%
#> 4 1 1 1 1 0 0.5 1 1 1 3
#> gear.25% gear.50% gear.75% gear.100% carb.0% carb.25% carb.50% carb.75%
#> 4 4 4 4 5 1 1 2 2
#> carb.100%
#> 4 2
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# For efficient (grouped) replacing and sweeping out computed statistics, use TRA()
sds <- fsd(mtcars)
head(TRA(mtcars, sds, "/")) # Simple scaling (if sd's not needed, use fsd(mtcars, TRA = "/"))
#> mpg cyl disp hp drat wt
#> Mazda RX4 3.484351 3.35961 1.2909608 1.604367 7.294100 2.677684
#> Mazda RX4 Wag 3.484351 3.35961 1.2909608 1.604367 7.294100 2.938298
#> Datsun 710 3.783009 2.23974 0.8713986 1.356419 7.200586 2.371079
#> Hornet 4 Drive 3.550719 3.35961 2.0816744 1.604367 5.760468 3.285784
#> Hornet Sportabout 3.102731 4.47948 2.9046619 2.552402 5.891388 3.515738
#> Valiant 3.003178 3.35961 1.8154137 1.531441 5.161978 3.536178
#> qsec vs am gear carb
#> Mazda RX4 9.211261 0.000000 2.004044 5.421494 2.4764735
#> Mazda RX4 Wag 9.524645 0.000000 2.004044 5.421494 2.4764735
#> Datsun 710 10.414433 1.984063 2.004044 5.421494 0.6191184
#> Hornet 4 Drive 10.878913 1.984063 0.000000 4.066120 0.6191184
#> Hornet Sportabout 9.524645 0.000000 0.000000 4.066120 1.2382368
#> Valiant 11.315413 1.984063 0.000000 4.066120 0.6191184
microbenchmark(TRA(mtcars, sds, "/"), sweep(mtcars, 2, sds, "/")) # A remarkable performance gain..
#> Unit: microseconds
#> expr min lq mean median uq max
#> TRA(mtcars, sds, "/") 2.050 2.4805 3.13527 2.9315 3.362 14.596
#> sweep(mtcars, 2, sds, "/") 300.817 310.8825 325.87989 320.7840 334.437 551.450
#> neval
#> 100
#> 100
sds <- fsd(mtcars, f2)
head(TRA(mtcars, sds, "/", f2)) # Groupd scaling (if sd's not needed: fsd(mtcars, f2, TRA = "/"))
#> mpg cyl disp hp drat wt qsec
#> Mazda RX4 14.447218 Inf 3.849628 4.534121 8.192327 7.352414 9.643407
#> Mazda RX4 Wag 14.447218 Inf 3.849628 4.534121 8.192327 8.068012 9.971493
#> Datsun 710 5.055626 Inf 4.019114 4.442421 10.534350 4.073293 11.061282
#> Hornet 4 Drive 14.722403 Inf 6.207525 4.534121 6.469838 9.022142 11.389297
#> Hornet Sportabout 7.304550 Inf 5.311981 3.432928 8.459515 4.529864 14.230606
#> Valiant 12.452126 Inf 5.413539 4.328025 5.797647 9.709677 11.846275
#> vs am gear carb
#> Mazda RX4 0.000000 1.870829 5.796551 2.2067091
#> Mazda RX4 Wag 0.000000 1.870829 5.796551 2.2067091
#> Datsun 710 3.316625 2.140872 7.416198 1.9148542
#> Hornet 4 Drive 1.870829 0.000000 4.347413 0.5516773
#> Hornet Sportabout NaN 0.000000 4.130678 1.2848321
#> Valiant 1.870829 0.000000 4.347413 0.5516773
# All functions above perserve the structure of matrices / data frames
# If conversions are required, use these efficient functions:
mtcarsM <- qM(mtcars) # Matrix from data.frame
head(qDF(mtcarsM)) # data.frame from matrix columns
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
head(mrtl(mtcarsM, TRUE, "data.frame")) # data.frame from matrix rows, etc..
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
#> mpg 21 21 22.8 21.4 18.7 18.1
#> cyl 6 6 4.0 6.0 8.0 6.0
#> Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL
#> mpg 14.3 24.4 22.8 19.2 17.8 16.4 17.3
#> cyl 8.0 4.0 4.0 6.0 6.0 8.0 8.0
#> Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial
#> mpg 15.2 10.4 10.4 14.7
#> cyl 8.0 8.0 8.0 8.0
#> Fiat 128 Honda Civic Toyota Corolla Toyota Corona Dodge Challenger
#> mpg 32.4 30.4 33.9 21.5 15.5
#> cyl 4.0 4.0 4.0 4.0 8.0
#> AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2
#> mpg 15.2 13.3 19.2 27.3 26
#> cyl 8.0 8.0 8.0 4.0 4
#> Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> mpg 30.4 15.8 19.7 15 21.4
#> cyl 4.0 8.0 6.0 8 4.0
#> [ reached 'max' / getOption("max.print") -- omitted 4 rows ]
head(qDT(mtcarsM, "cars")) # Saving row.names when converting matrix to data.table
#> cars mpg cyl disp hp drat wt qsec vs am
#> <char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
#> 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
#> 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
#> 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
#> gear carb
#> <num> <num>
#> 1: 4 4
#> 2: 4 4
#> 3: 4 1
#> 4: 3 1
#> [ reached getOption("max.print") -- omitted 2 rows ]
head(qDT(mtcars, "cars")) # Same use a data.frame
#> cars mpg cyl disp hp drat wt qsec vs am
#> <char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
#> 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
#> 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
#> 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
#> gear carb
#> <num> <num>
#> 1: 4 4
#> 2: 4 4
#> 3: 4 1
#> 4: 3 1
#> [ reached getOption("max.print") -- omitted 2 rows ]
## Now let's get some real data and see how we can use this power for data manipulation
head(wlddev) # World Bank World Development Data: 216 countries, 61 years, 5 series (columns 9-13)
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP
#> 1 32.446 NA 116769997 8996973
#> 2 32.962 NA 232080002 9169410
#> 3 33.471 NA 112839996 9351441
#> 4 33.971 NA 237720001 9543205
#> 5 34.463 NA 295920013 9744781
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# Starting with some discriptive tools...
namlab(wlddev, class = TRUE) # Show variable names, labels and classes
#> Variable Class
#> 1 country character
#> 2 iso3c factor
#> 3 date Date
#> 4 year integer
#> 5 decade integer
#> 6 region factor
#> 7 income factor
#> 8 OECD logical
#> 9 PCGDP numeric
#> 10 LIFEEX numeric
#> 11 GINI numeric
#> 12 ODA numeric
#> 13 POP numeric
#> Label
#> 1 Country Name
#> 2 Country Code
#> 3 Date Recorded (Fictitious)
#> 4 Year
#> 5 Decade
#> 6 Region
#> 7 Income Level
#> 8 Is OECD Member Country?
#> 9 GDP per capita (constant 2010 US$)
#> 10 Life expectancy at birth, total (years)
#> 11 Gini index (World Bank estimate)
#> 12 Net official development assistance and official aid received (constant 2018 US$)
#> 13 Population, total
fnobs(wlddev) # Observation count
#> country iso3c date year decade region income OECD PCGDP LIFEEX
#> 13176 13176 13176 13176 13176 13176 13176 13176 9470 11670
#> GINI ODA POP
#> 1744 8608 12919
pwnobs(wlddev) # Pairwise observation count
#> country iso3c date year decade region income OECD PCGDP LIFEEX GINI
#> country 13176 13176 13176 13176 13176 13176 13176 13176 9470 11670 1744
#> iso3c 13176 13176 13176 13176 13176 13176 13176 13176 9470 11670 1744
#> date 13176 13176 13176 13176 13176 13176 13176 13176 9470 11670 1744
#> year 13176 13176 13176 13176 13176 13176 13176 13176 9470 11670 1744
#> decade 13176 13176 13176 13176 13176 13176 13176 13176 9470 11670 1744
#> ODA POP
#> country 8608 12919
#> iso3c 8608 12919
#> date 8608 12919
#> year 8608 12919
#> decade 8608 12919
#> [ reached getOption("max.print") -- omitted 8 rows ]
head(fnobs(wlddev, wlddev$country)) # Grouped observation count
#> country iso3c date year decade region income OECD PCGDP LIFEEX
#> Afghanistan 61 61 61 61 61 61 61 61 18 60
#> Albania 61 61 61 61 61 61 61 61 40 60
#> Algeria 61 61 61 61 61 61 61 61 60 60
#> American Samoa 61 61 61 61 61 61 61 61 17 0
#> Andorra 61 61 61 61 61 61 61 61 50 0
#> GINI ODA POP
#> Afghanistan 0 60 60
#> Albania 9 32 60
#> Algeria 3 60 60
#> American Samoa 0 0 60
#> Andorra 0 0 60
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
fndistinct(wlddev) # Distinct values
#> country iso3c date year decade region income OECD PCGDP LIFEEX
#> 216 216 61 61 7 7 4 2 9470 10548
#> GINI ODA POP
#> 368 7832 12877
descr(wlddev) # Describe data
#> Dataset: wlddev, 13 Variables, N = 13176
#> --------------------------------------------------------------------------------
#> country (character): Country Name
#> Statistics
#> N Ndist
#> 13176 216
#> Table
#> Freq Perc
#> Afghanistan 61 0.46
#> Albania 61 0.46
#> Algeria 61 0.46
#> American Samoa 61 0.46
#> Andorra 61 0.46
#> Angola 61 0.46
#> Antigua and Barbuda 61 0.46
#> Argentina 61 0.46
#> Armenia 61 0.46
#> Aruba 61 0.46
#> Australia 61 0.46
#> Austria 61 0.46
#> Azerbaijan 61 0.46
#> Bahamas, The 61 0.46
#> ... 202 Others 12322 93.52
#>
#> Summary of Table Frequencies
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 61 61 61 61 61 61
#> --------------------------------------------------------------------------------
#> iso3c (factor): Country Code
#> Statistics
#> N Ndist
#> 13176 216
#> Table
#> Freq Perc
#> ABW 61 0.46
#> AFG 61 0.46
#> AGO 61 0.46
#> ALB 61 0.46
#> AND 61 0.46
#> ARE 61 0.46
#> ARG 61 0.46
#> ARM 61 0.46
#> ASM 61 0.46
#> ATG 61 0.46
#> AUS 61 0.46
#> AUT 61 0.46
#> AZE 61 0.46
#> BDI 61 0.46
#> ... 202 Others 12322 93.52
#>
#> Summary of Table Frequencies
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 61 61 61 61 61 61
#> --------------------------------------------------------------------------------
#> date (Date): Date Recorded (Fictitious)
#> Statistics
#> N Ndist Min Max
#> 13176 61 1961-01-01 2021-01-01
#> --------------------------------------------------------------------------------
#> year (integer): Year
#> Statistics
#> N Ndist Mean SD Min Max Skew Kurt
#> 13176 61 1990 17.61 1960 2020 -0 1.8
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 1960 1963 1966 1975 1990 2005 2014 2017 2020
#> --------------------------------------------------------------------------------
#> decade (integer): Decade
#> Statistics
#> N Ndist Mean SD Min Max Skew Kurt
#> 13176 7 1985.57 17.51 1960 2020 0.03 1.79
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 1960 1960 1960 1970 1990 2000 2010 2010 2020
#> --------------------------------------------------------------------------------
#> region (factor): Region
#> Statistics
#> N Ndist
#> 13176 7
#> Table
#> Freq Perc
#> Europe & Central Asia 3538 26.85
#> Sub-Saharan Africa 2928 22.22
#> Latin America & Caribbean 2562 19.44
#> East Asia & Pacific 2196 16.67
#> Middle East & North Africa 1281 9.72
#> South Asia 488 3.70
#> North America 183 1.39
#> --------------------------------------------------------------------------------
#> income (factor): Income Level
#> Statistics
#> N Ndist
#> 13176 4
#> Table
#> Freq Perc
#> High income 4819 36.57
#> Upper middle income 3660 27.78
#> Lower middle income 2867 21.76
#> Low income 1830 13.89
#> --------------------------------------------------------------------------------
#> OECD (logical): Is OECD Member Country?
#> Statistics
#> N Ndist
#> 13176 2
#> Table
#> Freq Perc
#> FALSE 10980 83.33
#> TRUE 2196 16.67
#> --------------------------------------------------------------------------------
#> PCGDP (numeric): GDP per capita (constant 2010 US$)
#> Statistics (28.13% NAs)
#> N Ndist Mean SD Min Max Skew Kurt
#> 9470 9470 12048.78 19077.64 132.08 196061.42 3.13 17.12
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95%
#> 227.71 399.62 555.55 1303.19 3767.16 14787.03 35646.02 48507.84
#> 99%
#> 92340.28
#> --------------------------------------------------------------------------------
#> LIFEEX (numeric): Life expectancy at birth, total (years)
#> Statistics (11.43% NAs)
#> N Ndist Mean SD Min Max Skew Kurt
#> 11670 10548 64.3 11.48 18.91 85.42 -0.67 2.67
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 35.83 42.77 46.83 56.36 67.44 72.95 77.08 79.34 82.36
#> --------------------------------------------------------------------------------
#> GINI (numeric): Gini index (World Bank estimate)
#> Statistics (86.76% NAs)
#> N Ndist Mean SD Min Max Skew Kurt
#> 1744 368 38.53 9.2 20.7 65.8 0.6 2.53
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 24.6 26.3 27.6 31.5 36.4 45 52.6 55.98 60.5
#> --------------------------------------------------------------------------------
#> ODA (numeric): Net official development assistance and official aid received (constant 2018 US$)
#> Statistics (34.67% NAs)
#> N Ndist Mean SD Min Max Skew
#> 8608 7832 454'720131 868'712654 -997'679993 2.56715605e+10 6.98
#> Kurt
#> 114.89
#> Quantiles
#> 1% 5% 10% 25% 50% 75%
#> -12'593999.7 1'363500.01 8'347000.31 44'887499.8 165'970001 495'042503
#> 90% 95% 99%
#> 1.18400697e+09 1.93281696e+09 3.73380782e+09
#> --------------------------------------------------------------------------------
#> POP (numeric): Population, total
#> Statistics (1.95% NAs)
#> N Ndist Mean SD Min Max Skew Kurt
#> 12919 12877 24'245971.6 102'120674 2833 1.39771500e+09 9.75 108.91
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90%
#> 8698.84 31083.3 62268.4 443791 4'072517 12'816178 46'637331.4
#> 95% 99%
#> 81'177252.5 308'862641
#> --------------------------------------------------------------------------------
varying(wlddev, ~ country) # Show which variables vary within countries
#> iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
#> FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
#> POP
#> TRUE
qsu(wlddev, pid = ~ country, # Panel-summarize columns 9 though 12 of this data
cols = 9:12, vlabels = TRUE) # (between and within countries)
#> , , PCGDP: GDP per capita (constant 2010 US$)
#>
#> N/T Mean SD Min Max
#> Overall 9470 12048.778 19077.6416 132.0776 196061.417
#> Between 206 12962.6054 20189.9007 253.1886 141200.38
#> Within 45.9709 12048.778 6723.6808 -33504.8721 76767.5254
#>
#> , , LIFEEX: Life expectancy at birth, total (years)
#>
#> N/T Mean SD Min Max
#> Overall 11670 64.2963 11.4764 18.907 85.4171
#> Between 207 64.9537 9.8936 40.9663 85.4171
#> Within 56.3768 64.2963 6.0842 32.9068 84.4198
#>
#> , , GINI: Gini index (World Bank estimate)
#>
#> N/T Mean SD Min Max
#> Overall 1744 38.5341 9.2006 20.7 65.8
#> Between 167 39.4233 8.1356 24.8667 61.7143
#> Within 10.4431 38.5341 2.9277 25.3917 55.3591
#>
#> , , ODA: Net official development assistance and official aid received (constant 2018 US$)
#>
#> N/T Mean SD Min Max
#> Overall 8608 454'720131 868'712654 -997'679993 2.56715605e+10
#> Between 178 439'168412 569'049959 468717.916 3.62337432e+09
#> Within 48.3596 454'720131 650'709624 -2.44379420e+09 2.45610972e+10
#>
qsu(wlddev, ~ region, ~ country, # Do all of that by region and also compute higher moments
cols = 9:12, higher = TRUE) # -> returns a 4D array
#> , , Overall, PCGDP
#>
#> N/T Mean SD Min
#> East Asia & Pacific 1467 10513.2441 14383.5507 132.0776
#> Europe & Central Asia 2243 25992.9618 26435.1316 366.9354
#> Latin America & Caribbean 1976 7628.4477 8818.5055 1005.4085
#> Middle East & North Africa 842 13878.4213 18419.7912 578.5996
#> North America 180 48699.76 24196.2855 16405.9053
#> South Asia 382 1235.9256 1611.2232 265.9625
#> Sub-Saharan Africa 2380 1840.0259 2596.0104 164.3366
#> Max Skew Kurt
#> East Asia & Pacific 71992.1517 1.6392 4.7419
#> Europe & Central Asia 196061.417 2.2022 10.1977
#> Latin America & Caribbean 88391.3331 4.1702 29.3739
#> Middle East & North Africa 116232.753 2.4178 9.7669
#> North America 113236.091 0.938 2.9688
#> South Asia 8476.564 2.7874 10.3402
#> Sub-Saharan Africa 20532.9523 3.1161 14.4175
#>
#> , , Between, PCGDP
#>
#> N/T Mean SD Min Max
#> East Asia & Pacific 34 10513.2441 12771.742 444.2899 39722.0077
#> Europe & Central Asia 56 25992.9618 24051.035 809.4753 141200.38
#> Latin America & Caribbean 38 7628.4477 8470.9708 1357.3326 77403.7443
#> Skew Kurt
#> East Asia & Pacific 1.1488 2.7089
#> Europe & Central Asia 2.0026 9.0733
#> Latin America & Caribbean 4.4548 32.4956
#>
#> [ reached getOption("max.print") -- omitted 4 row(s) and 10 matrix slice(s) ]
qsu(wlddev, ~ region, ~ country, cols = 9:12,
higher = TRUE, array = FALSE) |> # Return as a list of matrices..
unlist2d(c("Variable","Trans"), row.names = "Region") |> head()# and turn into a tidy data.frame
#> Variable Trans Region N Mean SD
#> 1 PCGDP Overall East Asia & Pacific 1467 10513.244 14383.551
#> 2 PCGDP Overall Europe & Central Asia 2243 25992.962 26435.132
#> 3 PCGDP Overall Latin America & Caribbean 1976 7628.448 8818.505
#> 4 PCGDP Overall Middle East & North Africa 842 13878.421 18419.791
#> 5 PCGDP Overall North America 180 48699.760 24196.285
#> 6 PCGDP Overall South Asia 382 1235.926 1611.223
#> Min Max Skew Kurt
#> 1 132.0776 71992.152 1.6392248 4.741856
#> 2 366.9354 196061.417 2.2022472 10.197685
#> 3 1005.4085 88391.333 4.1701769 29.373869
#> 4 578.5996 116232.753 2.4177586 9.766883
#> 5 16405.9053 113236.091 0.9380056 2.968769
#> 6 265.9625 8476.564 2.7873830 10.340176
pwcor(num_vars(wlddev), P = TRUE) # Pairwise correlations with p-value
#> year decade PCGDP LIFEEX GINI ODA POP
#> year 1 .99* .16* .47* -.20* .14* .06*
#> decade .99* 1 .15* .46* -.20* .14* .06*
#> PCGDP .16* .15* 1 .57* -.44* -.16* -.06*
#> LIFEEX .47* .46* .57* 1 -.35* -.02 .03*
#> GINI -.20* -.20* -.44* -.35* 1 -.20* .04
#> ODA .14* .14* -.16* -.02 -.20* 1 .31*
#> POP .06* .06* -.06* .03* .04 .31* 1
pwcor(fmean(num_vars(wlddev), wlddev$country), P = TRUE) # Correlating country means
#> Warning: the standard deviation is zero
#> year decade PCGDP LIFEEX GINI ODA POP
#> year NA NA NA NA NA NA NA
#> decade NA 1 .00 .00 .00 .00 .00
#> PCGDP NA .00 1 .60* -.42* -.25* -.07
#> LIFEEX NA .00 .60* 1 -.40* -.21* -.02
#> GINI NA .00 -.42* -.40* 1 -.19* -.04
#> ODA NA .00 -.25* -.21* -.19* 1 .50*
#> POP NA .00 -.07 -.02 -.04 .50* 1
pwcor(fwithin(num_vars(wlddev), wlddev$country), P = TRUE) # Within-country correlations
#> year decade PCGDP LIFEEX GINI ODA POP
#> year 1 .99* .44* .84* -.21* .19* .24*
#> decade .99* 1 .44* .83* -.19* .18* .24*
#> PCGDP .44* .44* 1 .31* -.01 -.01 .06*
#> LIFEEX .84* .83* .31* 1 -.16* .17* .29*
#> GINI -.21* -.19* -.01 -.16* 1 -.08* .01
#> ODA .19* .18* -.01 .17* -.08* 1 -.11*
#> POP .24* .24* .06* .29* .01 -.11* 1
psacf(wlddev, ~country, ~year, cols = 9:12) # Panel-data Autocorrelation function
pspacf(wlddev, ~country, ~year, cols = 9:12) # Partial panel-autocorrelations
psmat(wlddev, ~iso3c, ~year, cols = 9:12) |> plot() # Convert panel to 3D array and plot
## collapse offers a few very efficent functions for data manipulation:
# Fast selecting and replacing columns
series <- get_vars(wlddev, 9:12) # Same as wlddev[9:12] but 2x faster
series <- fselect(wlddev, PCGDP:ODA) # Same thing: > 100x faster than dplyr::select
get_vars(wlddev, 9:12) <- series # Replace, 8x faster wlddev[9:12] <- series + replaces names
fselect(wlddev, PCGDP:ODA) <- series # Same thing
# Fast subsetting
head(fsubset(wlddev, country == "Ireland", -country, -iso3c))
#> date year decade region income OECD PCGDP LIFEEX
#> 1 1961-01-01 1960 1960 Europe & Central Asia High income TRUE NA 69.79651
#> 2 1962-01-01 1961 1960 Europe & Central Asia High income TRUE NA 69.97827
#> 3 1963-01-01 1962 1960 Europe & Central Asia High income TRUE NA 70.13407
#> 4 1964-01-01 1963 1960 Europe & Central Asia High income TRUE NA 70.27293
#> 5 1965-01-01 1964 1960 Europe & Central Asia High income TRUE NA 70.40129
#> 6 1966-01-01 1965 1960 Europe & Central Asia High income TRUE NA 70.52315
#> GINI ODA POP
#> 1 NA NA 2828600
#> 2 NA NA 2824400
#> 3 NA NA 2836050
#> 4 NA NA 2852650
#> 5 NA NA 2866550
#> 6 NA NA 2877300
head(fsubset(wlddev, country == "Ireland" & year > 1990, year, PCGDP:ODA))
#> year PCGDP LIFEEX GINI ODA
#> 1 1991 24642.11 75.00527 NA NA
#> 2 1992 25292.81 75.18095 NA NA
#> 3 1993 25844.34 75.33612 NA NA
#> 4 1994 27224.37 75.47680 36.9 NA
#> 5 1995 29694.65 75.61756 37.0 NA
#> 6 1996 31644.89 75.83171 35.6 NA
ss(wlddev, 1:10, 1:10) # This is an order of magnitude faster than wlddev[1:10, 1:10]
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA
#> 6 Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA
#> 7 Afghanistan AFG 1967-01-01 1966 1960 South Asia Low income FALSE NA
#> LIFEEX
#> 1 32.446
#> 2 32.962
#> 3 33.471
#> 4 33.971
#> 5 34.463
#> 6 34.948
#> 7 35.430
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Fast transforming
head(ftransform(wlddev, ODA_GDP = ODA / PCGDP, ODA_LIFEEX = sqrt(ODA) / LIFEEX))
#> Warning: NaNs produced
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP ODA_GDP ODA_LIFEEX
#> 1 32.446 NA 116769997 8996973 NA 333.0462
#> 2 32.962 NA 232080002 9169410 NA 462.1738
#> 3 33.471 NA 112839996 9351441 NA 317.3678
#> 4 33.971 NA 237720001 9543205 NA 453.8627
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
settransform(wlddev, ODA_GDP = ODA / PCGDP, ODA_LIFEEX = sqrt(ODA) / LIFEEX) # by reference
#> Warning: NaNs produced
head(ftransform(wlddev, PCGDP = NULL, ODA = NULL, GINI_sum = fsum(GINI)))
#> country iso3c date year decade region income OECD LIFEEX
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE 32.446
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE 32.962
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE 33.471
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE 33.971
#> 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE 34.463
#> GINI POP ODA_GDP ODA_LIFEEX GINI_sum
#> 1 NA 8996973 NA 333.0462 67203.5
#> 2 NA 9169410 NA 462.1738 67203.5
#> 3 NA 9351441 NA 317.3678 67203.5
#> 4 NA 9543205 NA 453.8627 67203.5
#> 5 NA 9744781 NA 499.1535 67203.5
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
head(ftransformv(wlddev, 9:12, log)) # Can also transform with lists of columns
#> Warning: NaNs produced
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP ODA_GDP ODA_LIFEEX
#> 1 3.479577 NA 18.57572 8996973 NA 333.0462
#> 2 3.495355 NA 19.26259 9169410 NA 462.1738
#> 3 3.510679 NA 18.54148 9351441 NA 317.3678
#> 4 3.525507 NA 19.28660 9543205 NA 453.8627
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
head(ftransformv(wlddev, 9:12, fscale, apply = FALSE)) # apply = FALSE invokes fscale.data.frame
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283 NA -0.3890241 8996973 NA 333.0462
#> 2 -2.730321 NA -0.2562874 9169410 NA 462.1738
#> 3 -2.685969 NA -0.3935480 9351441 NA 317.3678
#> 4 -2.642402 NA -0.2497951 9543205 NA 453.8627
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
settransformv(wlddev, 9:12, fscale, apply = FALSE) # Changing the data by reference
ftransform(wlddev) <- fscale(gv(wlddev, 9:12)) # Same thing (using replacement method)
library(magrittr) # Same thing, using magrittr
wlddev %<>% ftransformv(9:12, fscale, apply = FALSE)
wlddev %>% ftransform(gv(., 9:12) |> # With compound pipes: Scaling and lagging
fscale() |> flag(0:2, iso3c, year)) |> head()
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP ODA_GDP ODA_LIFEEX L1.PCGDP L2.PCGDP
#> 1 -2.775283 NA -0.3890241 8996973 NA 333.0462 NA NA
#> 2 -2.730321 NA -0.2562874 9169410 NA 462.1738 NA NA
#> 3 -2.685969 NA -0.3935480 9351441 NA 317.3678 NA NA
#> L1.LIFEEX L2.LIFEEX L1.GINI L2.GINI L1.ODA L2.ODA
#> 1 NA NA NA NA NA NA
#> 2 -2.775283 NA NA NA -0.3890241 NA
#> 3 -2.730321 -2.775283 NA NA -0.2562874 -0.3890241
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Fast reordering
head(roworder(wlddev, -country, year))
#> country iso3c date year decade region income
#> 1 Zimbabwe ZWE 1961-01-01 1960 1960 Sub-Saharan Africa Lower middle income
#> 2 Zimbabwe ZWE 1962-01-01 1961 1960 Sub-Saharan Africa Lower middle income
#> 3 Zimbabwe ZWE 1963-01-01 1962 1960 Sub-Saharan Africa Lower middle income
#> 4 Zimbabwe ZWE 1964-01-01 1963 1960 Sub-Saharan Africa Lower middle income
#> OECD PCGDP LIFEEX GINI ODA POP ODA_GDP ODA_LIFEEX
#> 1 FALSE -0.5794259 -0.9826503 NA NA 3776681 NA NA
#> 2 FALSE -0.5779547 -0.9422195 NA -0.5233262 3905034 97.77415 5.912678
#> 3 FALSE -0.5789920 -0.9018759 NA NA 4039201 NA NA
#> 4 FALSE -0.5775741 -0.8620551 NA -0.4094221 4178726 96162.61027 182.938198
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
head(colorder(wlddev, country, year))
#> country year iso3c date decade region income OECD PCGDP
#> 1 Afghanistan 1960 AFG 1961-01-01 1960 South Asia Low income FALSE NA
#> 2 Afghanistan 1961 AFG 1962-01-01 1960 South Asia Low income FALSE NA
#> 3 Afghanistan 1962 AFG 1963-01-01 1960 South Asia Low income FALSE NA
#> 4 Afghanistan 1963 AFG 1964-01-01 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283 NA -0.3890241 8996973 NA 333.0462
#> 2 -2.730321 NA -0.2562874 9169410 NA 462.1738
#> 3 -2.685969 NA -0.3935480 9351441 NA 317.3678
#> 4 -2.642402 NA -0.2497951 9543205 NA 453.8627
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# Fast renaming
head(frename(wlddev, country = Ctry, year = Yr))
#> Ctry iso3c date Yr decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283 NA -0.3890241 8996973 NA 333.0462
#> 2 -2.730321 NA -0.2562874 9169410 NA 462.1738
#> 3 -2.685969 NA -0.3935480 9351441 NA 317.3678
#> 4 -2.642402 NA -0.2497951 9543205 NA 453.8627
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
setrename(wlddev, country = Ctry, year = Yr) # By reference
head(frename(wlddev, tolower, cols = 9:12))
#> Ctry iso3c date Yr decade region income OECD pcgdp
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> lifeex gini oda POP ODA_GDP ODA_LIFEEX
#> 1 -2.775283 NA -0.3890241 8996973 NA 333.0462
#> 2 -2.730321 NA -0.2562874 9169410 NA 462.1738
#> 3 -2.685969 NA -0.3935480 9351441 NA 317.3678
#> 4 -2.642402 NA -0.2497951 9543205 NA 453.8627
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# Fast grouping
fgroup_by(wlddev, Ctry, decade) |> fgroup_vars() |> head()
#> Ctry decade
#> 1 Afghanistan 1960
#> 2 Afghanistan 1960
#> 3 Afghanistan 1960
#> 4 Afghanistan 1960
#> 5 Afghanistan 1960
#> 6 Afghanistan 1960
rm(wlddev) # .. but only works with collapse functions
## Now lets start putting things together
wlddev |> fsubset(year > 1990, region, income, PCGDP:ODA) |>
fgroup_by(region, income) |> fmean() # Fast aggregation using the mean
#> region income PCGDP LIFEEX GINI
#> 1 East Asia & Pacific High income 32671.0522 78.21996 32.95000
#> 2 East Asia & Pacific Lower middle income 1738.5111 65.45647 36.51972
#> 3 East Asia & Pacific Upper middle income 4575.8695 70.87431 40.64815
#> 4 Europe & Central Asia High income 40814.1215 77.67583 30.94218
#> 5 Europe & Central Asia Low income 695.7951 65.04128 32.13333
#> 6 Europe & Central Asia Lower middle income 1779.7149 68.79873 30.66176
#> 7 Europe & Central Asia Upper middle income 5182.1800 71.30199 34.85362
#> 8 Latin America & Caribbean High income 19864.5057 75.55674 48.35714
#> 9 Latin America & Caribbean Low income 1189.8492 58.97359 41.10000
#> 10 Latin America & Caribbean Lower middle income 1987.5292 69.31612 50.58125
#> 11 Latin America & Caribbean Upper middle income 6224.6674 72.58127 49.88547
#> ODA
#> 1 112194118
#> 2 509965862
#> 3 219146704
#> 4 321907195
#> 5 244539286
#> 6 395371417
#> 7 463652924
#> 8 31307379
#> 9 753747928
#> 10 559540257
#> 11 188305879
#> [ reached 'max' / getOption("max.print") -- omitted 12 rows ]
# Same thing using dplyr manipulation verbs
library(dplyr)
wlddev |> filter(year > 1990) |> select(region, income, PCGDP:ODA) |>
group_by(region,income) |> fmean() # This is already a lot faster than summarize_all(mean)
#> # A tibble: 23 × 6
#> region income PCGDP LIFEEX GINI ODA
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 East Asia & Pacific High income 32671. 78.2 33.0 112194118.
#> 2 East Asia & Pacific Lower middle income 1739. 65.5 36.5 509965862.
#> 3 East Asia & Pacific Upper middle income 4576. 70.9 40.6 219146704.
#> 4 Europe & Central Asia High income 40814. 77.7 30.9 321907195.
#> 5 Europe & Central Asia Low income 696. 65.0 32.1 244539286.
#> 6 Europe & Central Asia Lower middle income 1780. 68.8 30.7 395371417.
#> 7 Europe & Central Asia Upper middle income 5182. 71.3 34.9 463652924.
#> 8 Latin America & Caribbean High income 19865. 75.6 48.4 31307379.
#> 9 Latin America & Caribbean Low income 1190. 59.0 41.1 753747928.
#> 10 Latin America & Caribbean Lower middle income 1988. 69.3 50.6 559540257.
#> # ℹ 13 more rows
wlddev |> fsubset(year > 1990, region, income, PCGDP:POP) |>
fgroup_by(region, income) |> fmean(POP) # Weighted group means
#> region income sum.POP PCGDP
#> 1 East Asia & Pacific High income 6165902760 37889.3406
#> 2 East Asia & Pacific Lower middle income 13784998066 2135.6182
#> 3 East Asia & Pacific Upper middle income 40150644873 3769.4215
#> 4 Europe & Central Asia High income 13923291507 34583.4203
#> 5 Europe & Central Asia Low income 203224216 722.7351
#> 6 Europe & Central Asia Lower middle income 2399154808 2145.1352
#> 7 Europe & Central Asia Upper middle income 8919358306 8238.8391
#> 8 Latin America & Caribbean High income 842939686 13068.1667
#> 9 Latin America & Caribbean Low income 267125746 1193.5889
#> 10 Latin America & Caribbean Lower middle income 815192217 2011.6260
#> LIFEEX GINI ODA
#> 1 80.79250 32.81601 -79785907
#> 2 68.24548 36.40362 1060544119
#> 3 73.07773 40.38496 1229983586
#> 4 78.82923 32.27710 1107199019
#> 5 65.76604 32.22326 261043896
#> 6 68.98050 28.97857 556232624
#> 7 69.78322 38.66475 1187976647
#> 8 76.76809 48.85497 97880105
#> 9 59.34831 41.10000 794781510
#> 10 69.33538 50.52363 571015463
#> [ reached 'max' / getOption("max.print") -- omitted 13 rows ]
wlddev |> fsubset(year > 1990, region, income, PCGDP:POP) |>
fgroup_by(region, income) |> fsd(POP) # Weighted group standard deviations
#> region income sum.POP PCGDP
#> 1 East Asia & Pacific High income 6165902760 11619.90339
#> 2 East Asia & Pacific Lower middle income 13784998066 1074.84975
#> 3 East Asia & Pacific Upper middle income 40150644873 2374.39410
#> 4 Europe & Central Asia High income 13923291507 13593.46879
#> 5 Europe & Central Asia Low income 203224216 238.47730
#> 6 Europe & Central Asia Lower middle income 2399154808 841.97662
#> 7 Europe & Central Asia Upper middle income 8919358306 3175.38606
#> 8 Latin America & Caribbean High income 842939686 6273.64310
#> 9 Latin America & Caribbean Low income 267125746 70.56928
#> 10 Latin America & Caribbean Lower middle income 815192217 580.31064
#> LIFEEX GINI ODA
#> 1 2.730868 1.247116 129087430
#> 2 4.230245 4.316708 943798728
#> 3 2.428883 2.458222 1519636778
#> 4 2.895801 2.994940 1131585991
#> 5 4.555972 1.547793 112926454
#> 6 1.688117 4.573107 370416116
#> 7 3.707457 4.227846 1068502287
#> 8 2.421882 5.289102 71662290
#> 9 2.815436 0.000000 566758933
#> 10 4.220554 5.684628 243414533
#> [ reached 'max' / getOption("max.print") -- omitted 13 rows ]
wlddev |> na_omit(cols = "POP") |> fgroup_by(region, income) |>
fselect(PCGDP:POP) |> fnth(0.75, POP) # Weighted group third quartile
#> region income sum.POP PCGDP LIFEEX
#> 1 East Asia & Pacific High income 11407808149 42237.359 81.72529
#> 2 East Asia & Pacific Lower middle income 22174820629 2359.522 69.86711
#> 3 East Asia & Pacific Upper middle income 69639871478 3815.390 73.83074
#> 4 Europe & Central Asia High income 27285316560 37968.308 79.13238
#> 5 Europe & Central Asia Low income 311485944 1028.229 68.92227
#> 6 Europe & Central Asia Lower middle income 4511786205 3109.060 70.16087
#> 7 Europe & Central Asia Upper middle income 16972478305 10579.353 70.58594
#> 8 Latin America & Caribbean High income 1466292826 13936.359 77.63032
#> 9 Latin America & Caribbean Low income 429756890 1429.081 60.35597
#> 10 Latin America & Caribbean Lower middle income 1290800630 2248.974 71.20178
#> GINI ODA
#> 1 33.50000 755719050
#> 2 39.33432 1840884811
#> 3 42.40186 2582104924
#> 4 34.70000 1751293143
#> 5 33.59662 370935975
#> 6 29.83209 724985435
#> 7 41.20000 1799582651
#> 8 54.89047 158773677
#> 9 41.10000 968787392
#> 10 55.50000 660837650
#> [ reached 'max' / getOption("max.print") -- omitted 13 rows ]
wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |>
fwithin() |> head() # Within transformation
#> PCGDP LIFEEX GINI ODA
#> 1 NA -16.75117 NA -1370778502
#> 2 NA -16.23517 NA -1255468497
#> 3 NA -15.72617 NA -1374708502
#> 4 NA -15.22617 NA -1249828497
#> 5 NA -14.73417 NA -1191628485
#> 6 NA -14.24917 NA -1145708502
wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |>
fmedian(TRA = "-") |> head() # Grouped centering using the median
#> PCGDP LIFEEX GINI ODA
#> 1 NA -17.5395 NA -144765007
#> 2 NA -17.0235 NA -29455002
#> 3 NA -16.5145 NA -148695007
#> 4 NA -16.0145 NA -23815002
#> 5 NA -15.5225 NA 34385010
#> 6 NA -15.0375 NA 80304993
# Replacing data points by the weighted first quartile:
wlddev |> na_omit(cols = "POP") |> fgroup_by(country) |>
fselect(country, year, PCGDP:POP) %>%
ftransform(fselect(., -country, -year) |>
fnth(0.25, POP, "fill")) |> head()
#> country year PCGDP LIFEEX GINI ODA POP
#> 1 Afghanistan 1960 407.2632 45.91724 NA 237905880 8996973
#> 2 Afghanistan 1961 407.2632 45.91724 NA 237905880 9169410
#> 3 Afghanistan 1962 407.2632 45.91724 NA 237905880 9351441
#> 4 Afghanistan 1963 407.2632 45.91724 NA 237905880 9543205
#> 5 Afghanistan 1964 407.2632 45.91724 NA 237905880 9744781
#> 6 Afghanistan 1965 407.2632 45.91724 NA 237905880 9956320
wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |> fscale() |> head() # Standardizing
#> PCGDP LIFEEX GINI ODA
#> 1 NA -1.653181 NA -0.6498451
#> 2 NA -1.602256 NA -0.5951801
#> 3 NA -1.552023 NA -0.6517082
#> 4 NA -1.502678 NA -0.5925063
#> 5 NA -1.454122 NA -0.5649154
#> 6 NA -1.406257 NA -0.5431461
wlddev |> fgroup_by(country) |> fselect(PCGDP:POP) |>
fscale(POP) |> head() # Weighted..
#> POP PCGDP LIFEEX GINI ODA
#> 1 8996973 NA -2.172769 NA -0.9502811
#> 2 9169410 NA -2.119489 NA -0.9011481
#> 3 9351441 NA -2.066932 NA -0.9519557
#> 4 9543205 NA -2.015304 NA -0.8987449
#> 5 9744781 NA -1.964502 NA -0.8739462
#> 6 9956320 NA -1.914423 NA -0.8543799
wlddev |> fselect(country, year, PCGDP:ODA) |> # Adding 1 lead and 2 lags of each variable
fgroup_by(country) |> flag(-1:2, year) |> head()
#> country year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX
#> 1 Afghanistan 1960 NA NA NA NA 32.962 32.446 NA
#> 2 Afghanistan 1961 NA NA NA NA 33.471 32.962 32.446
#> 3 Afghanistan 1962 NA NA NA NA 33.971 33.471 32.962
#> L2.LIFEEX F1.GINI GINI L1.GINI L2.GINI F1.ODA ODA L1.ODA
#> 1 NA NA NA NA NA 232080002 116769997 NA
#> 2 NA NA NA NA NA 112839996 232080002 116769997
#> 3 32.446 NA NA NA NA 237720001 112839996 232080002
#> L2.ODA
#> 1 NA
#> 2 NA
#> 3 116769997
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
wlddev |> fselect(country, year, PCGDP:ODA) |> # Adding 1 lead and 10-year growth rates
fgroup_by(country) |> fgrowth(c(0:1,10), 1, year) |> head()
#> country year PCGDP G1.PCGDP L10G1.PCGDP LIFEEX G1.LIFEEX L10G1.LIFEEX
#> 1 Afghanistan 1960 NA NA NA 32.446 NA NA
#> 2 Afghanistan 1961 NA NA NA 32.962 1.590335 NA
#> 3 Afghanistan 1962 NA NA NA 33.471 1.544202 NA
#> 4 Afghanistan 1963 NA NA NA 33.971 1.493830 NA
#> 5 Afghanistan 1964 NA NA NA 34.463 1.448294 NA
#> GINI G1.GINI L10G1.GINI ODA G1.ODA L10G1.ODA
#> 1 NA NA NA 116769997 NA NA
#> 2 NA NA NA 232080002 98.74969 NA
#> 3 NA NA NA 112839996 -51.37884 NA
#> 4 NA NA NA 237720001 110.66998 NA
#> 5 NA NA NA 295920013 24.48259 NA
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# etc...
# Aggregation with multiple functions
wlddev |> fsubset(year > 1990, region, income, PCGDP:ODA) |>
fgroup_by(region, income) %>% {
add_vars(fgroup_vars(., "unique"),
fmedian(., keep.group_vars = FALSE) |> add_stub("median_"),
fmean(., keep.group_vars = FALSE) |> add_stub("mean_"),
fsd(., keep.group_vars = FALSE) |> add_stub("sd_"))
} |> head()
#> region income median_PCGDP median_LIFEEX
#> 1 East Asia & Pacific High income 32573.8177 78.54024
#> 2 East Asia & Pacific Lower middle income 1658.3786 66.07200
#> 3 East Asia & Pacific Upper middle income 3583.2189 70.61500
#> 4 Europe & Central Asia High income 36201.7707 78.16061
#> 5 Europe & Central Asia Low income 668.9513 66.08000
#> median_GINI median_ODA mean_PCGDP mean_LIFEEX mean_GINI mean_ODA sd_PCGDP
#> 1 32.75 11500000 32671.0522 78.21996 32.95000 112194118 13031.1867
#> 2 35.70 257079987 1738.5111 65.45647 36.51972 509965862 904.3004
#> 3 40.35 49730000 4575.8695 70.87431 40.64815 219146704 2489.3795
#> 4 31.10 138889999 40814.1215 77.67583 30.94218 321907195 29485.3091
#> 5 32.45 239055000 695.7951 65.04128 32.13333 244539286 242.4158
#> sd_LIFEEX sd_GINI sd_ODA
#> 1 3.825737 1.322624 223580786
#> 2 5.003373 4.779528 686619373
#> 3 3.157915 3.506637 684346804
#> 4 3.810700 3.676878 632730086
#> 5 4.723263 1.713087 116363515
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# Transformation with multiple functions
wlddev |> fselect(country, year, PCGDP:ODA) |>
fgroup_by(country) %>% {
add_vars(fdiff(., c(1,10), 1, year) |> flag(0:2, year), # Sequence of lagged differences
ftransform(., fselect(., PCGDP:ODA) |> fwithin() |> add_stub("W.")) |>
flag(0:2, year, keep.ids = FALSE)) # Sequence of lagged demeaned vars
} |> head()
#> country year D1.PCGDP L1.D1.PCGDP L2.D1.PCGDP L10D1.PCGDP L1.L10D1.PCGDP
#> 1 Afghanistan 1960 NA NA NA NA NA
#> L2.L10D1.PCGDP D1.LIFEEX L1.D1.LIFEEX L2.D1.LIFEEX L10D1.LIFEEX
#> 1 NA NA NA NA NA
#> L1.L10D1.LIFEEX L2.L10D1.LIFEEX D1.GINI L1.D1.GINI L2.D1.GINI L10D1.GINI
#> 1 NA NA NA NA NA NA
#> L1.L10D1.GINI L2.L10D1.GINI D1.ODA L1.D1.ODA L2.D1.ODA L10D1.ODA L1.L10D1.ODA
#> 1 NA NA NA NA NA NA NA
#> L2.L10D1.ODA PCGDP L1.PCGDP L2.PCGDP LIFEEX L1.LIFEEX L2.LIFEEX GINI L1.GINI
#> 1 NA NA NA NA 32.446 NA NA NA NA
#> L2.GINI ODA L1.ODA L2.ODA W.PCGDP L1.W.PCGDP L2.W.PCGDP W.LIFEEX
#> 1 NA 116769997 NA NA NA NA NA -16.75117
#> L1.W.LIFEEX L2.W.LIFEEX W.GINI L1.W.GINI L2.W.GINI W.ODA L1.W.ODA
#> 1 NA NA NA NA NA -1370778502 NA
#> L2.W.ODA
#> 1 NA
#> [ reached 'max' / getOption("max.print") -- omitted 5 rows ]
# With ftransform, can also easily do one or more grouped mutations on the fly..
settransform(wlddev, median_ODA = fmedian(ODA, list(region, income), TRA = "fill"))
settransform(wlddev, sd_ODA = fsd(ODA, list(region, income), TRA = "fill"),
mean_GDP = fmean(PCGDP, country, TRA = "fill"))
wlddev %<>% ftransform(fmedian(list(median_ODA = ODA, median_GDP = PCGDP),
list(region, income), TRA = "fill"))
# On a groped data frame it is also possible to grouped transform certain columns
# but perform aggregate operatins on others:
wlddev |> fgroup_by(region, income) %>%
ftransform(gmedian_GDP = fmedian(PCGDP, GRP(.), TRA = "replace"),
omedian_GDP = fmedian(PCGDP, TRA = "replace"), # "replace" preserves NA's
omedian_GDP_fill = fmedian(PCGDP)) |> tail()
#> country iso3c date year decade region
#> 13171 Zimbabwe ZWE 2016-01-01 2015 2010 Sub-Saharan Africa
#> 13172 Zimbabwe ZWE 2017-01-01 2016 2010 Sub-Saharan Africa
#> 13173 Zimbabwe ZWE 2018-01-01 2017 2010 Sub-Saharan Africa
#> income OECD PCGDP LIFEEX GINI ODA POP
#> 13171 Lower middle income FALSE 1234.103 59.534 NA 817729980 13814629
#> 13172 Lower middle income FALSE 1224.310 60.294 NA 687659973 14030390
#> 13173 Lower middle income FALSE 1263.321 60.812 44.3 753909973 14236745
#> median_ODA sd_ODA mean_GDP median_GDP gmedian_GDP omedian_GDP
#> 13171 280630005 694376321 1219.436 1336.053 1336.053 3767.162
#> 13172 280630005 694376321 1219.436 1336.053 1336.053 3767.162
#> 13173 280630005 694376321 1219.436 1336.053 1336.053 3767.162
#> omedian_GDP_fill
#> 13171 3767.162
#> 13172 3767.162
#> 13173 3767.162
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
rm(wlddev)
## For multi-type data aggregation, the function collap() offers ease and flexibility
# Aggregate this data by country and decade: Numeric columns with mean, categorical with mode
head(collap(wlddev, ~ country + decade, fmean, fmode))
#> country iso3c date year decade region income OECD
#> 1 Afghanistan AFG 1961-01-01 1964.5 1960 South Asia Low income FALSE
#> 2 Afghanistan AFG 1971-01-01 1974.5 1970 South Asia Low income FALSE
#> 3 Afghanistan AFG 1981-01-01 1984.5 1980 South Asia Low income FALSE
#> 4 Afghanistan AFG 1991-01-01 1994.5 1990 South Asia Low income FALSE
#> 5 Afghanistan AFG 2001-01-01 2004.5 2000 South Asia Low income FALSE
#> PCGDP LIFEEX GINI ODA POP
#> 1 NA 34.6908 NA 222288999 9886773
#> 2 NA 39.9053 NA 236169998 12451803
#> 3 NA 46.4176 NA 71666001 12291854
#> 4 NA 53.0097 NA 317255000 16931903
#> 5 379.373 58.0881 NA 3054051961 24870022
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# taking weighted mean and weighted mode:
head(collap(wlddev, ~ country + decade, fmean, fmode, w = ~ POP, wFUN = fsum))
#> country iso3c date year decade region income OECD
#> 1 Afghanistan AFG 1970-01-01 1964.675 1960 South Asia Low income FALSE
#> 2 Afghanistan AFG 1980-01-01 1974.672 1970 South Asia Low income FALSE
#> 3 Afghanistan AFG 1981-01-01 1984.364 1980 South Asia Low income FALSE
#> 4 Afghanistan AFG 2000-01-01 1994.941 1990 South Asia Low income FALSE
#> 5 Afghanistan AFG 2010-01-01 2004.788 2000 South Asia Low income FALSE
#> PCGDP LIFEEX GINI ODA POP
#> 1 NA 34.77716 NA 223006447 98867731
#> 2 NA 40.00367 NA 236798314 124518028
#> 3 NA 46.32098 NA 70613923 122918537
#> 4 NA 53.25897 NA 306818649 169319030
#> 5 382.5583 58.23630 NA 3240143310 248700217
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# Multi-function aggregation of certain columns
head(collap(wlddev, ~ country + decade,
list(fmean, fmedian, fsd),
list(ffirst, flast), cols = c(3,9:12)))
#> country ffirst.date flast.date decade fmean.PCGDP fmedian.PCGDP fsd.PCGDP
#> 1 Afghanistan 1961-01-01 1970-01-01 1960 NA NA NA
#> 2 Afghanistan 1971-01-01 1980-01-01 1970 NA NA NA
#> 3 Afghanistan 1981-01-01 1990-01-01 1980 NA NA NA
#> 4 Afghanistan 1991-01-01 2000-01-01 1990 NA NA NA
#> fmean.LIFEEX fmedian.LIFEEX fsd.LIFEEX fmean.GINI fmedian.GINI fsd.GINI
#> 1 34.6908 34.7055 1.490964 NA NA NA
#> 2 39.9053 39.8430 1.738383 NA NA NA
#> 3 46.4176 46.4005 2.161460 NA NA NA
#> 4 53.0097 53.1200 1.695424 NA NA NA
#> fmean.ODA fmedian.ODA fsd.ODA
#> 1 222288999 234900002 80884369
#> 2 236169998 246509995 34241008
#> 3 71666001 48539999 72958531
#> 4 317255000 285175003 160500141
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# Customized Aggregation: Assign columns to functions
head(collap(wlddev, ~ country + decade,
custom = list(fmean = 9:10, fsd = 9:12, flast = 3, ffirst = 6:8)))
#> country flast.date decade ffirst.region ffirst.income ffirst.OECD
#> 1 Afghanistan 1970-01-01 1960 South Asia Low income FALSE
#> 2 Afghanistan 1980-01-01 1970 South Asia Low income FALSE
#> 3 Afghanistan 1990-01-01 1980 South Asia Low income FALSE
#> 4 Afghanistan 2000-01-01 1990 South Asia Low income FALSE
#> 5 Afghanistan 2010-01-01 2000 South Asia Low income FALSE
#> fmean.PCGDP fsd.PCGDP fmean.LIFEEX fsd.LIFEEX fsd.GINI fsd.ODA
#> 1 NA NA 34.6908 1.490964 NA 80884369
#> 2 NA NA 39.9053 1.738383 NA 34241008
#> 3 NA NA 46.4176 2.161460 NA 72958531
#> 4 NA NA 53.0097 1.695424 NA 160500141
#> 5 379.373 53.66524 58.0881 1.565630 NA 2013110021
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# For grouped data frames use collapg
wlddev |> fsubset(year > 1990, country, region, income, PCGDP:ODA) |>
fgroup_by(country) |> collapg(fmean, ffirst) |>
ftransform(AMGDP = PCGDP > fmedian(PCGDP, list(region, income), TRA = "fill"),
AMODA = ODA > fmedian(ODA, income, TRA = "replace_fill")) |> head()
#> country region income PCGDP
#> 1 Afghanistan South Asia Low income 483.8351
#> 2 Albania Europe & Central Asia Upper middle income 3127.1510
#> 3 Algeria Middle East & North Africa Upper middle income 4056.4341
#> 4 American Samoa East Asia & Pacific Upper middle income 10071.0659
#> 5 Andorra Europe & Central Asia High income 40768.8453
#> 6 Angola Sub-Saharan Africa Lower middle income 2876.5065
#> LIFEEX GINI ODA AMGDP AMODA
#> 1 58.32283 NA 2888193791 FALSE TRUE
#> 2 75.19266 31.41111 343797587 FALSE TRUE
#> 3 72.57717 31.45000 287459654 FALSE TRUE
#> 4 NA NA NA TRUE NA
#> 5 NA NA NA TRUE NA
#> 6 51.59572 48.66667 412104483 TRUE FALSE
## Additional flexibility for data transformation tasks is offerend by tidy transformation operators
# Within-transformation (centering on overall mean)
head(W(wlddev, ~ country, cols = 9:12, mean = "overall.mean"))
#> country W.PCGDP W.LIFEEX W.GINI W.ODA
#> 1 Afghanistan NA 47.54514 NA -916058371
#> 2 Afghanistan NA 48.06114 NA -800748366
#> 3 Afghanistan NA 48.57014 NA -919988371
#> 4 Afghanistan NA 49.07014 NA -795108366
#> 5 Afghanistan NA 49.56214 NA -736908354
#> 6 Afghanistan NA 50.04714 NA -690988371
# Partialling out country and year fixed effects
head(HDW(wlddev, PCGDP + LIFEEX ~ qF(country) + qF(year)))
#> HDW.PCGDP HDW.LIFEEX
#> 1 1578.6211 -1.3980224
#> 2 1412.8849 -1.1838196
#> 3 917.2033 -1.0547978
#> 4 627.8605 -0.8296048
#> 5 168.0458 -0.6683027
#> 6 -234.9535 -0.4708428
# Same, adding ODA as continuous regressor
head(HDW(wlddev, PCGDP + LIFEEX ~ qF(country) + qF(year) + ODA))
#> HDW.PCGDP HDW.LIFEEX
#> 1 -324.3991 -1.1765307
#> 2 -439.5404 -0.9751559
#> 3 -598.9266 -0.7835446
#> 4 100.2175 -0.6186010
#> 5 -70.7664 -0.4966332
#> 6 330.3561 -0.2257800
# Standardizing (scaling and centering) by country
head(STD(wlddev, ~ country, cols = 9:12))
#> country STD.PCGDP STD.LIFEEX STD.GINI STD.ODA
#> 1 Afghanistan NA -1.653181 NA -0.6498451
#> 2 Afghanistan NA -1.602256 NA -0.5951801
#> 3 Afghanistan NA -1.552023 NA -0.6517082
#> 4 Afghanistan NA -1.502678 NA -0.5925063
#> 5 Afghanistan NA -1.454122 NA -0.5649154
#> 6 Afghanistan NA -1.406257 NA -0.5431461
# Computing 1 lead and 3 lags of the 4 series
head(L(wlddev, -1:3, ~ country, ~year, cols = 9:12))
#> country year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP L3.PCGDP F1.LIFEEX LIFEEX
#> 1 Afghanistan 1960 NA NA NA NA NA 32.962 32.446
#> 2 Afghanistan 1961 NA NA NA NA NA 33.471 32.962
#> 3 Afghanistan 1962 NA NA NA NA NA 33.971 33.471
#> L1.LIFEEX L2.LIFEEX L3.LIFEEX F1.GINI GINI L1.GINI L2.GINI L3.GINI F1.ODA
#> 1 NA NA NA NA NA NA NA NA 232080002
#> 2 32.446 NA NA NA NA NA NA NA 112839996
#> 3 32.962 32.446 NA NA NA NA NA NA 237720001
#> ODA L1.ODA L2.ODA L3.ODA
#> 1 116769997 NA NA NA
#> 2 232080002 116769997 NA NA
#> 3 112839996 232080002 116769997 NA
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Computing the 1- and 10-year first differences
head(D(wlddev, c(1,10), 1, ~ country, ~year, cols = 9:12))
#> country year D1.PCGDP L10D1.PCGDP D1.LIFEEX L10D1.LIFEEX D1.GINI
#> 1 Afghanistan 1960 NA NA NA NA NA
#> 2 Afghanistan 1961 NA NA 0.516 NA NA
#> 3 Afghanistan 1962 NA NA 0.509 NA NA
#> 4 Afghanistan 1963 NA NA 0.500 NA NA
#> 5 Afghanistan 1964 NA NA 0.492 NA NA
#> 6 Afghanistan 1965 NA NA 0.485 NA NA
#> L10D1.GINI D1.ODA L10D1.ODA
#> 1 NA NA NA
#> 2 NA 115310005 NA
#> 3 NA -119240005 NA
#> 4 NA 124880005 NA
#> 5 NA 58200012 NA
#> 6 NA 45919983 NA
head(D(wlddev, c(1,10), 1:2, ~ country, ~year, cols = 9:12)) # ..first and second differences
#> country year D1.PCGDP D2.PCGDP L10D1.PCGDP L10D2.PCGDP D1.LIFEEX
#> 1 Afghanistan 1960 NA NA NA NA NA
#> 2 Afghanistan 1961 NA NA NA NA 0.516
#> 3 Afghanistan 1962 NA NA NA NA 0.509
#> D2.LIFEEX L10D1.LIFEEX L10D2.LIFEEX D1.GINI D2.GINI L10D1.GINI L10D2.GINI
#> 1 NA NA NA NA NA NA NA
#> 2 NA NA NA NA NA NA NA
#> 3 -0.007 NA NA NA NA NA NA
#> D1.ODA D2.ODA L10D1.ODA L10D2.ODA
#> 1 NA NA NA NA
#> 2 115310005 NA NA NA
#> 3 -119240005 -234550011 NA NA
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Computing the 1- and 10-year growth rates
head(G(wlddev, c(1,10), 1, ~ country, ~year, cols = 9:12))
#> country year G1.PCGDP L10G1.PCGDP G1.LIFEEX L10G1.LIFEEX G1.GINI
#> 1 Afghanistan 1960 NA NA NA NA NA
#> 2 Afghanistan 1961 NA NA 1.590335 NA NA
#> 3 Afghanistan 1962 NA NA 1.544202 NA NA
#> 4 Afghanistan 1963 NA NA 1.493830 NA NA
#> 5 Afghanistan 1964 NA NA 1.448294 NA NA
#> 6 Afghanistan 1965 NA NA 1.407306 NA NA
#> L10G1.GINI G1.ODA L10G1.ODA
#> 1 NA NA NA
#> 2 NA 98.74969 NA
#> 3 NA -51.37884 NA
#> 4 NA 110.66998 NA
#> 5 NA 24.48259 NA
#> 6 NA 15.51770 NA
# Adding growth rate variables to dataset
add_vars(wlddev) <- G(wlddev, c(1, 10), 1, ~ country, ~year, cols = 9:12, keep.ids = FALSE)
get_vars(wlddev, "G1.", regex = TRUE) <- NULL # Deleting again
# These operators can conveniently be used in regression formulas:
# Using a Mundlak (1978) procedure to estimate the effect of OECD on LIFEEX, controlling for PCGDP
lm(LIFEEX ~ log(PCGDP) + OECD + B(log(PCGDP), country),
wlddev |> fselect(country, OECD, PCGDP, LIFEEX) |> na_omit())
#>
#> Call:
#> lm(formula = LIFEEX ~ log(PCGDP) + OECD + B(log(PCGDP), country),
#> data = na_omit(fselect(wlddev, country, OECD, PCGDP, LIFEEX)))
#>
#> Coefficients:
#> (Intercept) log(PCGDP) OECDTRUE
#> 19.32590 8.20551 0.02478
#> B(log(PCGDP), country)
#> -2.65428
#>
# Adding 10-year lagged life-expectancy to allow for some convergence effects (dynamic panel model)
lm(LIFEEX ~ L(LIFEEX, 10, country) + log(PCGDP) + OECD + B(log(PCGDP), country),
wlddev |> fselect(country, OECD, PCGDP, LIFEEX) |> na_omit())
#>
#> Call:
#> lm(formula = LIFEEX ~ L(LIFEEX, 10, country) + log(PCGDP) + OECD +
#> B(log(PCGDP), country), data = na_omit(fselect(wlddev, country,
#> OECD, PCGDP, LIFEEX)))
#>
#> Coefficients:
#> (Intercept) L(LIFEEX, 10, country) log(PCGDP)
#> 9.2756 0.8656 0.9229
#> OECDTRUE B(log(PCGDP), country)
#> 0.4158 -0.6581
#>
# Tranformation functions and operators also support indexed data classes:
wldi <- findex_by(wlddev, country, year)
head(W(wldi$PCGDP)) # Country-demeaning
#> [1] NA NA NA NA NA NA
#>
#> Indexed by: country [1] | year [6 (61)]
head(W(wldi, cols = 9:12))
#> country year W.PCGDP W.LIFEEX W.GINI W.ODA
#> 1 Afghanistan 1960 NA -16.75117 NA -1370778502
#> 2 Afghanistan 1961 NA -16.23517 NA -1255468497
#> 3 Afghanistan 1962 NA -15.72617 NA -1374708502
#> 4 Afghanistan 1963 NA -15.22617 NA -1249828497
#> 5 Afghanistan 1964 NA -14.73417 NA -1191628485
#> 6 Afghanistan 1965 NA -14.24917 NA -1145708502
#>
#> Indexed by: country [1] | year [6 (61)]
head(W(wldi$PCGDP, effect = 2)) # Time-demeaning
#> [1] NA NA NA NA NA NA
#>
#> Indexed by: country [1] | year [6 (61)]
head(W(wldi, effect = 2, cols = 9:12))
#> country year W.PCGDP W.LIFEEX W.GINI W.ODA
#> 1 Afghanistan 1960 NA -21.46606 NA -122241092
#> 2 Afghanistan 1961 NA -21.51241 NA -37552049
#> 3 Afghanistan 1962 NA -21.38618 NA -183366702
#> 4 Afghanistan 1963 NA -21.23172 NA -54896550
#> 5 Afghanistan 1964 NA -21.20502 NA -9633789
#> 6 Afghanistan 1965 NA -21.18163 NA 5438669
#>
#> Indexed by: country [1] | year [6 (61)]
head(HDW(wldi$PCGDP)) # Country- and time-demeaning
#> [1] NA NA NA NA NA NA
#>
#> Indexed by: country [1] | year [6 (61)]
head(HDW(wldi, cols = 9:12))
#> HDW.PCGDP HDW.LIFEEX HDW.GINI HDW.ODA
#> 1 NA -6.706423 NA -1093922188
#> 2 NA -6.688440 NA -1032355993
#> 3 NA -6.562210 NA -1156945288
#> 4 NA -6.472079 NA -1046169271
#> 5 NA -6.445378 NA -996348510
#> 6 NA -6.367659 NA -983277444
#>
#> Indexed by: country [1] | year [6 (61)]
head(STD(wldi$PCGDP)) # Standardizing by country
#> [1] NA NA NA NA NA NA
#>
#> Indexed by: country [1] | year [6 (61)]
head(STD(wldi, cols = 9:12))
#> country year STD.PCGDP STD.LIFEEX STD.GINI STD.ODA
#> 1 Afghanistan 1960 NA -1.653181 NA -0.6498451
#> 2 Afghanistan 1961 NA -1.602256 NA -0.5951801
#> 3 Afghanistan 1962 NA -1.552023 NA -0.6517082
#> 4 Afghanistan 1963 NA -1.502678 NA -0.5925063
#> 5 Afghanistan 1964 NA -1.454122 NA -0.5649154
#> 6 Afghanistan 1965 NA -1.406257 NA -0.5431461
#>
#> Indexed by: country [1] | year [6 (61)]
head(L(wldi$PCGDP, -1:3)) # Panel-lags
#> F1 -- L1 L2 L3
#> [1,] NA NA NA NA NA
#> [2,] NA NA NA NA NA
#> [3,] NA NA NA NA NA
#> [4,] NA NA NA NA NA
#> [5,] NA NA NA NA NA
#> [6,] NA NA NA NA NA
#> attr(,"class")
#> [1] "matrix" "array"
#>
#> Indexed by: country [1] | year [6 (61)]
head(L(wldi, -1:3, 9:12))
#> country year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP L3.PCGDP F1.LIFEEX LIFEEX
#> 1 Afghanistan 1960 NA NA NA NA NA 32.962 32.446
#> 2 Afghanistan 1961 NA NA NA NA NA 33.471 32.962
#> 3 Afghanistan 1962 NA NA NA NA NA 33.971 33.471
#> L1.LIFEEX L2.LIFEEX L3.LIFEEX F1.GINI GINI L1.GINI L2.GINI L3.GINI F1.ODA
#> 1 NA NA NA NA NA NA NA NA 232080002
#> 2 32.446 NA NA NA NA NA NA NA 112839996
#> 3 32.962 32.446 NA NA NA NA NA NA 237720001
#> ODA L1.ODA L2.ODA L3.ODA
#> 1 116769997 NA NA NA
#> 2 232080002 116769997 NA NA
#> 3 112839996 232080002 116769997 NA
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
#>
#> Indexed by: country [1] | year [6 (61)]
head(G(wldi$PCGDP)) # Panel-Growth rates
#> [1] NA NA NA NA NA NA
#>
#> Indexed by: country [1] | year [6 (61)]
head(G(wldi, 1, 1, 9:12))
#> country year G1.PCGDP G1.LIFEEX G1.GINI G1.ODA
#> 1 Afghanistan 1960 NA NA NA NA
#> 2 Afghanistan 1961 NA 1.590335 NA 98.74969
#> 3 Afghanistan 1962 NA 1.544202 NA -51.37884
#> 4 Afghanistan 1963 NA 1.493830 NA 110.66998
#> 5 Afghanistan 1964 NA 1.448294 NA 24.48259
#> 6 Afghanistan 1965 NA 1.407306 NA 15.51770
#>
#> Indexed by: country [1] | year [6 (61)]
lm(Dlog(PCGDP) ~ L(Dlog(LIFEEX), 0:3), wldi) # Panel data regression
#>
#> Call:
#> lm(formula = Dlog(PCGDP) ~ L(Dlog(LIFEEX), 0:3), data = wldi)
#>
#> Coefficients:
#> (Intercept) L(Dlog(LIFEEX), 0:3)-- L(Dlog(LIFEEX), 0:3)L1
#> 0.01544 -0.12618 0.38523
#> L(Dlog(LIFEEX), 0:3)L2 L(Dlog(LIFEEX), 0:3)L3
#> 0.54179 -0.16475
#>
rm(wldi)
# Remove all objects used in this example section
rm(v, d, w, f, f1, f2, g, mtcarsM, sds, series, wlddev)