vignettes/collapse_and_data.table.Rmd
collapse_and_data.table.Rmd
This vignette focuses on using collapse with the popular data.table package by Matt Dowle and Arun Srinivasan. In contrast to dplyr and plm whose methods (‘grouped_df’, ‘pseries’, ‘pdata.frame’) collapse supports, the integration between collapse and data.table is hidden in the ‘data.frame’ methods and collapse’s C code.
From version 1.6.0 collapse seamlessly handles
data.tables, permitting reference operations
(set*
, :=
) on data tables created with
collapse (qDT
) or returned from collapse’s data
manipulation functions (= all functions except .FAST_FUN
,
.OPERATOR_FUN
, BY
and TRA
, see
the NEWS
for details on the low-level integration). Apart from
data.table reference semantics, both packages work similarly on
the C/C++ side of things, and nicely complement each other in
functionality.
Both data.table and collapse are high-performance packages that work well together. For effective co-use it is helpful to understand where each has its strengths, what one can do what the other cannot, and where they overlap. Therefore this small comparison:
data.table offers an enhanced data frame based class to
contain data (including list columns). For this class it provides a
concise data manipulation syntax which also includes fast aggregation /
slit-apply-combine computing, (rolling, non-equi) joins, keying,
reshaping, some time-series functionality like lagging and rolling
statistics, set operations on tables and a number of very useful other
functions like the fast csv reader, fast switches, list-transpose etc..
data.table makes data management, and computations on data very
easy and salable, supporting huge datasets in a very memory efficient
way. The package caters well to the end user by compressing an enormous
amount of functionality into two square brackets []
. Some
of the exported functions are great for programming and also support
other classes, but a lot of the functionality and optimization of
data.table happens under the hood and can only be accessed
through the non-standard evaluation table [i, j, by]
syntax. This syntax has a cost of about 1-3 milliseconds for each call.
Memory efficiency and thread-parallelization make data.table
the star performer on huge data.
collapse is class-agnostic in nature, supporting
vectors, matrices, data frames and non-destructively handling most R
classes and objects. It focuses on advanced statistical computing,
proving fast column-wise grouped and weighted statistical functions,
fast and complex data aggregation and transformations, linear fitting,
time series and panel data computations, advanced summary statistics,
and recursive processing of lists of data objects. It also includes
powerful functions for data manipulation, grouping / factor generation,
recoding, handling outliers and missing values. The package default for
missing values is na.rm = TRUE
, which is implemented
efficiently in C/C++ in all functions. collapse supports both
tidyverse (piped) and base R / standard evaluation programming.
It makes accessible most of it’s internal C/C++ based functionality
(like grouping objects). collapse’s R functions are simple and
strongly optimized, i.e. they access the serial C/C++ code quickly,
resulting in baseline execution speeds of 10-50 microseconds. All of
this makes collapse ideal for advanced statistical computing on
matrices and larger datasets, and tasks requiring fast programs with
repeated function executions.
Applying collapse functions to a data.table always gives a data.table back e.g.
library(collapse)
library(magrittr)
library(data.table)
DT <- qDT(wlddev) # collapse::qDT converts objects to data.table using a shallow copy
DT %>% gby(country) %>% gv(9:13) %>% fmean
# country PCGDP LIFEEX GINI ODA POP
# 1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
# 2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: American Samoa 10071.0659 NA NA NA 43115.10
# 5: Andorra 40083.0911 NA NA NA 51547.35
# ---
# 212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
# 213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
# Same thing, but notice that fmean give's NA's for missing countries
DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13]
# country PCGDP LIFEEX GINI ODA POP
# 1: Afghanistan 483.8351 49.19717 NaN 1487548499 18362258.22
# 2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: American Samoa 10071.0659 NaN NaN NaN 43115.10
# 5: Andorra 40083.0911 NaN NaN NaN 51547.35
# ---
# 212: Virgin Islands (U.S.) 35629.7336 73.71292 NaN NaN 92238.53
# 213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
# This also works without magrittr pipes with the collap() function
collap(DT, ~ country, fmean, cols = 9:13)
# country PCGDP LIFEEX GINI ODA POP
# 1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
# 2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: American Samoa 10071.0659 NA NA NA 43115.10
# 5: Andorra 40083.0911 NA NA NA 51547.35
# ---
# 212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
# 213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
By default, collapse orders groups in aggregations, which is
equivalent to using keyby
with data.table.
gby / fgroup_by
has an argument sort = FALSE
to yield an unordered grouping equivalent to data.table’s
by
on character data1.
At this data size collapse outperforms data.table (which might reverse as data size grows, depending in your computer, the number of data.table threads used, and the function in question):
library(microbenchmark)
microbenchmark(collapse = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
data.table = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13])
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 203.483 209.3665 215.0721 212.0725 216.8900 317.873 100
# data.table 541.897 567.8705 610.1177 574.7995 580.5805 3719.110 100
It is critical to never do something like this:
DT[, lapply(.SD, fmean), keyby = country, .SDcols = 9:13]
# country PCGDP LIFEEX GINI ODA POP
# 1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
# 2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: American Samoa 10071.0659 NA NA NA 43115.10
# 5: Andorra 40083.0911 NA NA NA 51547.35
# ---
# 212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
# 213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
The reason is that collapse functions are S3 generic with methods for vectors, matrices and data frames among others. So you incur a method-dispatch for every column and every group the function is applied to.
fmean
# function (x, ...)
# UseMethod("fmean")
# <bytecode: 0x118c0cf20>
# <environment: namespace:collapse>
methods(fmean)
# [1] fmean.data.frame fmean.default fmean.grouped_df* fmean.list* fmean.matrix
# see '?methods' for accessing help and source code
You may now contend that base::mean
is also S3 generic,
but in this
DT[, lapply(.SD, mean, na.rm = TRUE), by = country, .SDcols = 9:13]
code data.table does not use base::mean
, but
data.table:::gmean
, an internal optimized mean function
which is efficiently applied over those groups (see
?data.table::GForce
). fmean
works similar, and
includes this functionality explicitly.
args(fmean.data.frame)
# function (x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = TRUE,
# drop = TRUE, nthreads = 1L, ...)
# NULL
Here we can see the x
argument for the data, the
g
argument for grouping vectors, a weight vector
w
, different options TRA
to transform the
original data using the computed means, and some functionality regarding
missing values (default: removed / skipped), group names (which are
added as row-names to a data frame, but not to a data.table)
etc. So we can also do
fmean(gv(DT, 9:13), DT$country)
# PCGDP LIFEEX GINI ODA POP
# 1: 483.8351 49.19717 NA 1487548499 18362258.22
# 2: 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: 10071.0659 NA NA NA 43115.10
# 5: 40083.0911 NA NA NA 51547.35
# ---
# 212: 35629.7336 73.71292 NA NA 92238.53
# 213: 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: 1219.4360 54.53360 45.93333 397104997 9402160.33
# Or
g <- GRP(DT, "country")
add_vars(g[["groups"]], fmean(gv(DT, 9:13), g))
# country PCGDP LIFEEX GINI ODA POP
# 1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
# 2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: American Samoa 10071.0659 NA NA NA 43115.10
# 5: Andorra 40083.0911 NA NA NA 51547.35
# ---
# 212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
# 213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
To give us the same result obtained through the high-level functions
gby / fgroup_by
or collap
. This is however not
what data.table is doing in
DT[, lapply(.SD, fmean), by = country, .SDcols = 9:13]
.
Since fmean
is not a function it recognizes and is able to
optimize, it does something like this,
BY(gv(DT, 9:13), g, fmean) # using collapse::BY
# PCGDP LIFEEX GINI ODA POP
# 1: 483.8351 49.19717 NA 1487548499 18362258.22
# 2: 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: 10071.0659 NA NA NA 43115.10
# 5: 40083.0911 NA NA NA 51547.35
# ---
# 212: 35629.7336 73.71292 NA NA 92238.53
# 213: 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: 1219.4360 54.53360 45.93333 397104997 9402160.33
which applies fmean
to every group in every column of
the data.
More generally, it is very important to understand that
collapse is not based around applying functions to data by
groups using some universal mechanism: The dplyr
data %>% group_by(...) %>% summarize(...) / mutate(...)
and data.table [i, j, by]
syntax are essentially
universal mechanisms to apply any function to data by groups.
data.table additionally internally optimizes some functions
(min, max, mean, median, var, sd, sum, prod, first, last, head, tail
)
which they called GForce, ?data.table::GForce
.
collapse instead provides grouped statistical and
transformation functions where all grouped computation is done
efficiently in C++, and some supporting mechanisms
(fgroup_by
, collap
) to operate them. In
data.table words, everything2 in collapse, the Fast
Statistical Functions, data transformations, time series etc. is
GForce optimized.
The full set of optimized grouped statistical and transformation functions in collapse is:
.FAST_FUN
# [1] "fmean" "fmedian" "fmode" "fsum" "fprod" "fsd" "fvar"
# [8] "fmin" "fmax" "fnth" "ffirst" "flast" "fnobs" "fndistinct"
# [15] "fcumsum" "fscale" "fbetween" "fwithin" "fhdbetween" "fhdwithin" "flag"
# [22] "fdiff" "fgrowth"
Additional optimized grouped functions include TRA
,
qsu
, varying
, fFtest
,
psmat
, psacf
, pspacf
,
psccf
.
The nice thing about those GForce (fast) functions provided by collapse is that they can be accessed explicitly and programmatically without any overhead as incurred through data.table, they cover a broader range of statistical operations (such as mode, distinct values, order statistics), support sampling weights, operate in a class-agnostic way on vectors, matrices, data.frame’s and many related classes, and cover transformations (replacing and sweeping, scaling, (higher order) centering, linear fitting) and time series functionality (lags, differences and growth rates, including irregular time series and unbalanced panels).
So if we would want to use fmean
inside the
data.table, we should do something like this:
# This does not save the grouping columns, we are simply passing a grouping vector to g
# and aggregating the subset of the data table (.SD).
DT[, fmean(.SD, country), .SDcols = 9:13]
# PCGDP LIFEEX GINI ODA POP
# 1: 483.8351 49.19717 NA 1487548499 18362258.22
# 2: 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: 10071.0659 NA NA NA 43115.10
# 5: 40083.0911 NA NA NA 51547.35
# ---
# 212: 35629.7336 73.71292 NA NA 92238.53
# 213: 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: 1219.4360 54.53360 45.93333 397104997 9402160.33
# If we want to keep the grouping columns, we need to group .SD first.
DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)]
# country PCGDP LIFEEX GINI ODA POP
# 1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
# 2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
# 3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
# 4: American Samoa 10071.0659 NA NA NA 43115.10
# 5: Andorra 40083.0911 NA NA NA 51547.35
# ---
# 212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
# 213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
# 214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
# 215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
# 216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
Needless to say this kind of programming seems a bit arcane, so there is actually not that great of a scope to use collapse’s Fast Statistical Functions for aggregations inside data.table. I drive this point home with a benchmark:
microbenchmark(collapse = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
data.table = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
data.table_base = DT[, lapply(.SD, base::mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
hybrid_bad = DT[, lapply(.SD, fmean), keyby = country, .SDcols = 9:13],
hybrid_ok = DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)])
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 197.251 214.1020 242.6109 228.2060 254.5690 512.910 100
# data.table 562.725 582.2205 672.5804 598.9075 620.9245 5628.849 100
# data.table_base 2447.987 2534.0050 2951.6318 2563.5045 2660.6335 8940.255 100
# hybrid_bad 2039.914 2090.5490 2302.6195 2119.7205 2196.8825 8774.246 100
# hybrid_ok 353.338 374.2890 466.1967 394.7070 423.3865 6232.164 100
It is evident that data.table has some overhead, so there is absolutely no need to do this kind of syntax manipulation.
There is more scope to use collapse transformation functions inside data.table.
Below some basic examples:
# Computing a column containing the sum of ODA received by country
DT[, sum_ODA := sum(ODA, na.rm = TRUE), by = country]
# Same using fsum; "replace_fill" overwrites missing values, "replace" keeps the
DT[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")]
# Same: A native collapse solution using settransform (or its shortcut form)
settfm(DT, sum_ODA = fsum(ODA, country, TRA = "replace_fill"))
# settfm may be more convenient than `:=` for multiple column modifications,
# each involving a different grouping:
# This computes the percentage of total ODA distributed received by
# each country both over time and within a given year
settfm(DT, perc_c_ODA = fsum(ODA, country, TRA = "%"),
perc_y_ODA = fsum(ODA, year, TRA = "%"))
The TRA
argument is available to all Fast
Statistical Functions (see the macro .FAST_STAT_FUN
)
and offers 10 different replacing and sweeping operations. Note that
TRA()
can also be called directly to replace or sweep with
a previously aggregated data.table. A set of operators
%rr%
, %r+%
, %r-%
,
%r*%
, %r/%
, %cr%
,
%c+%
, %c-%
, %c*%
,
%c/%
additionally facilitate row- or column-wise replacing
or sweeping out vectors of statistics or other
data.table’s.
Similarly, we can use the following vector valued functions
setdiff(.FAST_FUN, .FAST_STAT_FUN)
# [1] "fcumsum" "fscale" "fbetween" "fwithin" "fhdbetween" "fhdwithin" "flag"
# [8] "fdiff" "fgrowth"
for very efficient data transformations:
# Centering GDP
DT[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country]
DT[, demean_PCGDP := fwithin(PCGDP, country)]
# Lagging GDP
DT[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country]
DT[, lag_PCGDP := flag(PCGDP, 1L, country, year)]
# Computing a growth rate
DT[order(year), growth_PCGDP := (PCGDP / shift(PCGDP, 1L) - 1) * 100, by = country]
DT[, lag_PCGDP := fgrowth(PCGDP, 1L, 1L, country, year)] # 1 lag, 1 iteration
# Several Growth rates
DT[order(year), paste0("growth_", .c(PCGDP, LIFEEX, GINI, ODA)) := (.SD / shift(.SD, 1L) - 1) * 100,
by = country, .SDcols = 9:13]
# Same thing using collapse
DT %<>% tfm(gv(., 9:13) %>% fgrowth(1L, 1L, country, year) %>% add_stub("growth_"))
# Or even simpler using settransform and the Growth operator
settfmv(DT, 9:13, G, 1L, 1L, country, year, apply = FALSE)
head(DT)
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 1: Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.446 NA 116769997
# 2: Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.962 NA 232080002
# 3: Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.471 NA 112839996
# 4: Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.971 NA 237720001
# 5: Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.463 NA 295920013
# 6: Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.948 NA 341839996
# POP sum_ODA perc_c_ODA perc_y_ODA demean_PCGDP lag_PCGDP growth_PCGDP growth_LIFEEX
# 1: 8996973 89252909923 0.1308305 0.4441407 NA NA NA NA
# 2: 9169410 89252909923 0.2600251 0.7356654 NA NA NA 1.590335
# 3: 9351441 89252909923 0.1264272 0.3494956 NA NA NA 1.544202
# 4: 9543205 89252909923 0.2663443 0.7003399 NA NA NA 1.493830
# 5: 9744781 89252909923 0.3315522 0.8570540 NA NA NA 1.448294
# 6: 9956320 89252909923 0.3830015 0.8992630 NA NA NA 1.407306
# growth_GINI growth_ODA growth_POP G1.PCGDP G1.LIFEEX G1.GINI G1.ODA G1.POP
# 1: NA NA NA NA NA NA NA NA
# 2: NA 98.74969 1.916611 NA 1.590335 NA 98.74969 1.916611
# 3: NA -51.37884 1.985199 NA 1.544202 NA -51.37884 1.985199
# 4: NA 110.66998 2.050636 NA 1.493830 NA 110.66998 2.050636
# 5: NA 24.48259 2.112246 NA 1.448294 NA 24.48259 2.112246
# 6: NA 15.51770 2.170793 NA 1.407306 NA 15.51770 2.170793
Since transformations (:=
operations) are not highly
optimized in data.table, collapse will be faster in
most circumstances.
Also time
series functionality in collapse is significantly faster as it
does not require data to be ordered or balanced to compute. For example
flag
computes an ordered lag without sorting the entire
data first.
# Lets generate a large dataset and benchmark this stuff
DT_large <- replicate(1000, qDT(wlddev), simplify = FALSE) %>%
lapply(tfm, country = paste(country, rnorm(1))) %>%
rbindlist
# 12.7 million Obs
fdim(DT_large)
# [1] 13176000 13
microbenchmark(
S1 = DT_large[, sum_ODA := sum(ODA, na.rm = TRUE), by = country],
S2 = DT_large[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")],
S3 = settfm(DT_large, sum_ODA = fsum(ODA, country, TRA = "replace_fill")),
W1 = DT_large[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country],
W2 = DT_large[, demean_PCGDP := fwithin(PCGDP, country)],
L1 = DT_large[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country],
L2 = DT_large[, lag_PCGDP := flag(PCGDP, 1L, country, year)],
L3 = DT_large[, lag_PCGDP := shift(PCGDP, 1L), by = country], # Not ordered
L4 = DT_large[, lag_PCGDP := flag(PCGDP, 1L, country)], # Not ordered
times = 5
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# S1 289.65196 291.61980 310.36664 294.02760 312.65657 363.87725 5
# S2 97.49447 97.66614 114.25686 106.31001 131.33264 138.48107 5
# S3 96.41453 96.96201 97.64662 98.20955 98.24121 98.40578 5
# W1 861.21324 867.87144 883.36677 880.42527 884.60796 922.71595 5
# W2 101.27521 102.30845 112.87029 105.77282 122.77438 132.22061 5
# L1 2413.53921 2423.93374 2466.20643 2481.59011 2493.74759 2518.22152 5
# L2 115.12546 115.65313 139.44138 132.54923 136.48551 197.39356 5
# L3 1436.76694 1441.97611 1455.07418 1451.64998 1469.97181 1475.00608 5
# L4 82.27737 86.86567 106.38026 93.75946 109.72978 159.26901 5
rm(DT_large)
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 1057050 56.5 2029794 108.5 NA 2029794 108.5
# Vcells 2305758 17.6 273622651 2087.6 16384 301840586 2302.9
As mentioned, qDT
is a flexible and very fast function
to create / column-wise convert R objects to data.table’s. You
can also row-wise convert a matrix to data.table using
mrtl
:
# Creating a matrix from mtcars
m <- qM(mtcars)
str(m)
# num [1:32, 1:11] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
# ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
# Demonstrating another nice feature of qDT
qDT(m, row.names.col = "car") %>% head
# car mpg cyl disp hp drat wt qsec vs am gear carb
# 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# 5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# 6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Row-wise conversion to data.table
mrtl(m, names = TRUE, return = "data.table") %>% head(2)
# Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360 Merc 240D
# 1: 21 21 22.8 21.4 18.7 18.1 14.3 24.4
# 2: 6 6 4.0 6.0 8.0 6.0 8.0 4.0
# Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood
# 1: 22.8 19.2 17.8 16.4 17.3 15.2 10.4
# 2: 4.0 6.0 6.0 8.0 8.0 8.0 8.0
# Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
# 1: 10.4 14.7 32.4 30.4 33.9 21.5
# 2: 8.0 8.0 4.0 4.0 4.0 4.0
# Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
# 1: 15.5 15.2 13.3 19.2 27.3 26 30.4
# 2: 8.0 8.0 8.0 8.0 4.0 4 4.0
# Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
# 1: 15.8 19.7 15 21.4
# 2: 8.0 6.0 8 4.0
The computational efficiency of these functions makes them very useful to use in data.table based workflows.
# Benchmark
microbenchmark(qDT(m, "car"), mrtl(m, TRUE, "data.table"))
# Unit: microseconds
# expr min lq mean median uq max neval
# qDT(m, "car") 4.797 5.1045 6.24020 5.5965 6.027 51.783 100
# mrtl(m, TRUE, "data.table") 3.444 3.6080 3.94092 3.7515 3.977 16.769 100
For example we could regress the growth rate of GDP per capita on the Growth rate of life expectancy in each country and save results in a data.table:
library(lmtest)
wlddev %>% fselect(country, PCGDP, LIFEEX) %>%
# This counts missing values on PCGDP and LIFEEX only
na_omit(cols = -1L) %>%
# This removes countries with less than 20 observations
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>%
qDT %>%
# Run estimations by country using data.table
.[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country] %>% head
# country Coef Estimate Std. Error t value Pr(>|t|)
# 1: Albania (Intercept) -3.6146411 2.371885 -1.5239527 0.136023086
# 2: Albania G(LIFEEX) 22.1596308 7.288971 3.0401591 0.004325856
# 3: Algeria (Intercept) 0.5973329 1.740619 0.3431726 0.732731107
# 4: Algeria G(LIFEEX) 0.8412547 1.689221 0.4980134 0.620390703
# 5: Angola (Intercept) -3.3793976 1.540330 -2.1939445 0.034597175
# 6: Angola G(LIFEEX) 4.2362895 1.402380 3.0207852 0.004553260
If we only need the coefficients, not the standard errors, we can
also use collapse::flm
together with mrtl
:
wlddev %>% fselect(country, PCGDP, LIFEEX) %>%
na_omit(cols = -1L) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>%
qDT %>%
.[, mrtl(flm(fgrowth(PCGDP)[-1L],
cbind(Intercept = 1,
LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE),
keyby = country] %>% head
# country Intercept LIFEEX
# 1: Albania -3.61464113 22.1596308
# 2: Algeria 0.59733291 0.8412547
# 3: Angola -3.37939760 4.2362895
# 4: Antigua and Barbuda -3.11880717 18.8700870
# 5: Argentina 1.14613567 -0.2896305
# 6: Armenia 0.08178344 11.5523992
… which provides a significant speed gain here:
microbenchmark(
A = wlddev %>% fselect(country, PCGDP, LIFEEX) %>%
na_omit(cols = -1L) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>%
qDT %>%
.[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country],
B = wlddev %>% fselect(country, PCGDP, LIFEEX) %>%
na_omit(cols = -1L) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>%
qDT %>%
.[, mrtl(flm(fgrowth(PCGDP)[-1L],
cbind(Intercept = 1,
LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE),
keyby = country]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# A 57.866457 60.079780 64.135896 62.003111 67.19277 99.328240 100
# B 3.121699 3.251567 3.738713 3.340618 3.65722 9.700887 100
Another feature to highlight at this point are collapse’s
list processing functions, in particular rsplit
,
rapply2d
, get_elem
and unlist2d
.
rsplit
is an efficient recursive generalization of
split
:
DT_list <- rsplit(DT, country + year + PCGDP + LIFEEX ~ region + income)
# Note: rsplit(DT, year + PCGDP + LIFEEX ~ region + income, flatten = TRUE)
# would yield a simple list with interacted categories (like split)
str(DT_list, give.attr = FALSE)
# List of 7
# $ East Asia & Pacific :List of 3
# ..$ High income :Classes 'data.table' and 'data.frame': 793 obs. of 4 variables:
# .. ..$ country: chr [1:793] "Australia" "Australia" "Australia" "Australia" ...
# .. ..$ year : int [1:793] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:793] 19378 19469 19246 20053 21036 ...
# .. ..$ LIFEEX : num [1:793] 70.8 71 70.9 70.9 70.9 ...
# ..$ Lower middle income:Classes 'data.table' and 'data.frame': 793 obs. of 4 variables:
# .. ..$ country: chr [1:793] "Cambodia" "Cambodia" "Cambodia" "Cambodia" ...
# .. ..$ year : int [1:793] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:793] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:793] 41.2 41.4 41.5 41.7 41.9 ...
# ..$ Upper middle income:Classes 'data.table' and 'data.frame': 610 obs. of 4 variables:
# .. ..$ country: chr [1:610] "American Samoa" "American Samoa" "American Samoa" "American Samoa" ...
# .. ..$ year : int [1:610] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:610] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:610] NA NA NA NA NA NA NA NA NA NA ...
# $ Europe & Central Asia :List of 4
# ..$ High income :Classes 'data.table' and 'data.frame': 2257 obs. of 4 variables:
# .. ..$ country: chr [1:2257] "Andorra" "Andorra" "Andorra" "Andorra" ...
# .. ..$ year : int [1:2257] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:2257] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:2257] NA NA NA NA NA NA NA NA NA NA ...
# ..$ Low income :Classes 'data.table' and 'data.frame': 61 obs. of 4 variables:
# .. ..$ country: chr [1:61] "Tajikistan" "Tajikistan" "Tajikistan" "Tajikistan" ...
# .. ..$ year : int [1:61] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:61] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:61] 50.6 50.9 51.2 51.5 51.9 ...
# ..$ Lower middle income:Classes 'data.table' and 'data.frame': 244 obs. of 4 variables:
# .. ..$ country: chr [1:244] "Kyrgyz Republic" "Kyrgyz Republic" "Kyrgyz Republic" "Kyrgyz Republic" ...
# .. ..$ year : int [1:244] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:244] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:244] 56.1 56.6 57 57.4 57.9 ...
# ..$ Upper middle income:Classes 'data.table' and 'data.frame': 976 obs. of 4 variables:
# .. ..$ country: chr [1:976] "Albania" "Albania" "Albania" "Albania" ...
# .. ..$ year : int [1:976] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:976] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:976] 62.3 63.3 64.2 64.9 65.5 ...
# $ Latin America & Caribbean :List of 4
# ..$ High income :Classes 'data.table' and 'data.frame': 1037 obs. of 4 variables:
# .. ..$ country: chr [1:1037] "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" ...
# .. ..$ year : int [1:1037] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:1037] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:1037] 62 62.5 63 63.5 64 ...
# ..$ Low income :Classes 'data.table' and 'data.frame': 61 obs. of 4 variables:
# .. ..$ country: chr [1:61] "Haiti" "Haiti" "Haiti" "Haiti" ...
# .. ..$ year : int [1:61] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:61] 1512 1439 1523 1466 1414 ...
# .. ..$ LIFEEX : num [1:61] 41.8 42.2 42.6 43 43.4 ...
# ..$ Lower middle income:Classes 'data.table' and 'data.frame': 244 obs. of 4 variables:
# .. ..$ country: chr [1:244] "Bolivia" "Bolivia" "Bolivia" "Bolivia" ...
# .. ..$ year : int [1:244] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:244] 1005 1007 1042 1091 1112 ...
# .. ..$ LIFEEX : num [1:244] 41.8 42.1 42.5 42.8 43.2 ...
# ..$ Upper middle income:Classes 'data.table' and 'data.frame': 1220 obs. of 4 variables:
# .. ..$ country: chr [1:1220] "Argentina" "Argentina" "Argentina" "Argentina" ...
# .. ..$ year : int [1:1220] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:1220] 5643 5853 5711 5323 5773 ...
# .. ..$ LIFEEX : num [1:1220] 65.1 65.2 65.3 65.3 65.4 ...
# $ Middle East & North Africa:List of 4
# ..$ High income :Classes 'data.table' and 'data.frame': 488 obs. of 4 variables:
# .. ..$ country: chr [1:488] "Bahrain" "Bahrain" "Bahrain" "Bahrain" ...
# .. ..$ year : int [1:488] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:488] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:488] 51.9 53.2 54.6 55.9 57.2 ...
# ..$ Low income :Classes 'data.table' and 'data.frame': 122 obs. of 4 variables:
# .. ..$ country: chr [1:122] "Syrian Arab Republic" "Syrian Arab Republic" "Syrian Arab Republic" "Syrian Arab Republic" ...
# .. ..$ year : int [1:122] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:122] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:122] 52 52.6 53.2 53.8 54.4 ...
# ..$ Lower middle income:Classes 'data.table' and 'data.frame': 305 obs. of 4 variables:
# .. ..$ country: chr [1:305] "Djibouti" "Djibouti" "Djibouti" "Djibouti" ...
# .. ..$ year : int [1:305] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:305] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:305] 44 44.5 44.9 45.3 45.7 ...
# ..$ Upper middle income:Classes 'data.table' and 'data.frame': 366 obs. of 4 variables:
# .. ..$ country: chr [1:366] "Algeria" "Algeria" "Algeria" "Algeria" ...
# .. ..$ year : int [1:366] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:366] 2481 2091 1638 2146 2214 ...
# .. ..$ LIFEEX : num [1:366] 46.1 46.6 47.1 47.5 48 ...
# $ North America :List of 1
# ..$ High income:Classes 'data.table' and 'data.frame': 183 obs. of 4 variables:
# .. ..$ country: chr [1:183] "Bermuda" "Bermuda" "Bermuda" "Bermuda" ...
# .. ..$ year : int [1:183] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:183] 33363 34080 34763 34324 37202 ...
# .. ..$ LIFEEX : num [1:183] NA NA NA NA NA ...
# $ South Asia :List of 3
# ..$ Low income :Classes 'data.table' and 'data.frame': 122 obs. of 4 variables:
# .. ..$ country: chr [1:122] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
# .. ..$ year : int [1:122] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:122] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:122] 32.4 33 33.5 34 34.5 ...
# ..$ Lower middle income:Classes 'data.table' and 'data.frame': 244 obs. of 4 variables:
# .. ..$ country: chr [1:244] "Bangladesh" "Bangladesh" "Bangladesh" "Bangladesh" ...
# .. ..$ year : int [1:244] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:244] 372 384 394 381 411 ...
# .. ..$ LIFEEX : num [1:244] 45.4 46 46.6 47.1 47.6 ...
# ..$ Upper middle income:Classes 'data.table' and 'data.frame': 122 obs. of 4 variables:
# .. ..$ country: chr [1:122] "Maldives" "Maldives" "Maldives" "Maldives" ...
# .. ..$ year : int [1:122] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:122] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:122] 37.3 37.9 38.6 39.2 39.9 ...
# $ Sub-Saharan Africa :List of 4
# ..$ High income :Classes 'data.table' and 'data.frame': 61 obs. of 4 variables:
# .. ..$ country: chr [1:61] "Seychelles" "Seychelles" "Seychelles" "Seychelles" ...
# .. ..$ year : int [1:61] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:61] 2830 2617 2763 2966 3064 ...
# .. ..$ LIFEEX : num [1:61] NA NA NA NA NA NA NA NA NA NA ...
# ..$ Low income :Classes 'data.table' and 'data.frame': 1464 obs. of 4 variables:
# .. ..$ country: chr [1:1464] "Benin" "Benin" "Benin" "Benin" ...
# .. ..$ year : int [1:1464] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:1464] 712 724 689 710 745 ...
# .. ..$ LIFEEX : num [1:1464] 37.3 37.7 38.2 38.7 39.1 ...
# ..$ Lower middle income:Classes 'data.table' and 'data.frame': 1037 obs. of 4 variables:
# .. ..$ country: chr [1:1037] "Angola" "Angola" "Angola" "Angola" ...
# .. ..$ year : int [1:1037] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:1037] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ LIFEEX : num [1:1037] 37.5 37.8 38.1 38.4 38.8 ...
# ..$ Upper middle income:Classes 'data.table' and 'data.frame': 366 obs. of 4 variables:
# .. ..$ country: chr [1:366] "Botswana" "Botswana" "Botswana" "Botswana" ...
# .. ..$ year : int [1:366] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
# .. ..$ PCGDP : num [1:366] 408 425 444 460 480 ...
# .. ..$ LIFEEX : num [1:366] 49.2 49.7 50.2 50.6 51.1 ...
We can use rapply2d
to apply a function to each data
frame / data.table in an arbitrary nested structure:
# This runs region-income level regressions, with country fixed effects
# following Mundlak (1978)
lm_summary_list <- DT_list %>%
rapply2d(lm, formula = G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)) %>%
# Summarizing the results
rapply2d(summary, classes = "lm")
# This is a nested list of linear model summaries
str(lm_summary_list, give.attr = FALSE)
# List of 7
# $ East Asia & Pacific :List of 3
# ..$ High income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:441] -1.64 -2.59 2.75 3.45 2.48 ...
# .. ..$ coefficients : num [1:3, 1:4] 0.531 2.494 3.83 0.706 0.759 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 4.59
# .. ..$ df : int [1:3] 3 438 3
# .. ..$ r.squared : num 0.0525
# .. ..$ adj.r.squared: num 0.0481
# .. ..$ fstatistic : Named num [1:3] 12.1 2 438
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.02361 -0.00158 -0.04895 -0.00158 0.02728 ...
# .. ..$ na.action : 'omit' Named int [1:352] 1 61 62 63 64 65 66 67 68 69 ...
# ..$ Lower middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:549] -39.6968 3.6618 -0.0944 -1.8261 -1.0491 ...
# .. ..$ coefficients : num [1:3, 1:4] 1.348 0.524 0.949 0.701 0.757 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 5.4
# .. ..$ df : int [1:3] 3 546 3
# .. ..$ r.squared : num 0.00471
# .. ..$ adj.r.squared: num 0.00106
# .. ..$ fstatistic : Named num [1:3] 1.29 2 546
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.016821 0.000511 -0.022767 0.000511 0.01965 ...
# .. ..$ na.action : 'omit' Named int [1:244] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Upper middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:312] -32.29 -11.61 2.91 11.23 10.28 ...
# .. ..$ coefficients : num [1:3, 1:4] 1.507 -0.547 4.816 0.428 0.478 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 4.39
# .. ..$ df : int [1:3] 3 309 3
# .. ..$ r.squared : num 0.103
# .. ..$ adj.r.squared: num 0.0976
# .. ..$ fstatistic : Named num [1:3] 17.8 2 309
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.009471 0.000492 -0.013551 0.000492 0.011842 ...
# .. ..$ na.action : 'omit' Named int [1:298] 1 2 3 4 5 6 7 8 9 10 ...
# $ Europe & Central Asia :List of 4
# ..$ High income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:1355] 2.706 -0.548 1.001 3.034 0.257 ...
# .. ..$ coefficients : num [1:3, 1:4] 3.254 -0.172 -2.506 0.407 0.227 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 3.3
# .. ..$ df : int [1:3] 3 1352 3
# .. ..$ r.squared : num 0.00257
# .. ..$ adj.r.squared: num 0.00109
# .. ..$ fstatistic : Named num [1:3] 1.74 2 1352
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.015254 -0.000863 -0.05461 -0.000863 0.004722 ...
# .. ..$ na.action : 'omit' Named int [1:902] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Low income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:34] 0.166 -1.804 15.949 -0.778 7.165 ...
# .. ..$ coefficients : num [1:2, 1:4] -5.31 9.36 2.03 2.56 -2.61 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE TRUE
# .. ..$ sigma : num 8.43
# .. ..$ df : int [1:3] 2 32 3
# .. ..$ r.squared : num 0.295
# .. ..$ adj.r.squared: num 0.273
# .. ..$ fstatistic : Named num [1:3] 13.4 1 32
# .. ..$ cov.unscaled : num [1:2, 1:2] 0.0582 -0.0514 -0.0514 0.092
# .. ..$ na.action : 'omit' Named int [1:27] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Lower middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:121] -1.626 8.745 -14.47 0.298 -11.886 ...
# .. ..$ coefficients : num [1:3, 1:4] 0.106 4.631 1.499 1.315 0.938 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 6.02
# .. ..$ df : int [1:3] 3 118 3
# .. ..$ r.squared : num 0.178
# .. ..$ adj.r.squared: num 0.164
# .. ..$ fstatistic : Named num [1:3] 12.7 2 118
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.047775 -0.000927 -0.142782 -0.000927 0.024298 ...
# .. ..$ na.action : 'omit' Named int [1:123] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Upper middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:511] 0.761 -2.153 -4.091 -6.476 -3.43 ...
# .. ..$ coefficients : num [1:3, 1:4] 2.983 4.147 -3.351 0.698 0.779 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 8.28
# .. ..$ df : int [1:3] 3 508 3
# .. ..$ r.squared : num 0.0531
# .. ..$ adj.r.squared: num 0.0493
# .. ..$ fstatistic : Named num [1:3] 14.2 2 508
# .. ..$ cov.unscaled : num [1:3, 1:3] 7.11e-03 4.52e-05 -1.45e-02 4.52e-05 8.85e-03 ...
# .. ..$ na.action : 'omit' Named int [1:465] 1 2 3 4 5 6 7 8 9 10 ...
# $ Latin America & Caribbean :List of 4
# ..$ High income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:487] 2.39 6.02 6.1 1.71 -2.27 ...
# .. ..$ coefficients : num [1:3, 1:4] 1.015 0.483 2.613 0.677 0.952 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 4.71
# .. ..$ df : int [1:3] 3 484 3
# .. ..$ r.squared : num 0.00592
# .. ..$ adj.r.squared: num 0.00181
# .. ..$ fstatistic : Named num [1:3] 1.44 2 484
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.02062 0.00155 -0.05714 0.00155 0.04082 ...
# .. ..$ na.action : 'omit' Named int [1:550] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Low income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:59] -5.667 5.091 -4.46 -4.224 -0.526 ...
# .. ..$ coefficients : num [1:2, 1:4] -3.18 4.02 1.73 2.28 -1.83 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE TRUE
# .. ..$ sigma : num 3.79
# .. ..$ df : int [1:3] 2 57 3
# .. ..$ r.squared : num 0.0516
# .. ..$ adj.r.squared: num 0.0349
# .. ..$ fstatistic : Named num [1:3] 3.1 1 57
# .. ..$ cov.unscaled : num [1:2, 1:2] 0.209 -0.265 -0.265 0.364
# .. ..$ na.action : 'omit' Named int [1:2] 1 61
# ..$ Lower middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:231] -1.386 2.029 3.213 0.413 1.334 ...
# .. ..$ coefficients : num [1:3, 1:4] -1.678 -0.479 3.896 2.26 0.709 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 3.96
# .. ..$ df : int [1:3] 3 228 3
# .. ..$ r.squared : num 0.0081
# .. ..$ adj.r.squared: num -0.000602
# .. ..$ fstatistic : Named num [1:3] 0.931 2 228
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.3264 0.005 -0.4084 0.005 0.0321 ...
# .. ..$ na.action : 'omit' Named int [1:13] 1 61 62 63 64 65 66 67 122 123 ...
# ..$ Upper middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:1065] 1.97 -4.16 -8.5 6.72 7.17 ...
# .. ..$ coefficients : num [1:3, 1:4] 1.681 0.583 -0.124 0.353 0.512 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 4.22
# .. ..$ df : int [1:3] 3 1062 3
# .. ..$ r.squared : num 0.0016
# .. ..$ adj.r.squared: num -0.000283
# .. ..$ fstatistic : Named num [1:3] 0.85 2 1062
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.006982 0.000348 -0.013936 0.000348 0.014734 ...
# .. ..$ na.action : 'omit' Named int [1:155] 1 61 62 122 123 183 184 244 245 305 ...
# $ Middle East & North Africa:List of 4
# ..$ High income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:334] -10.728 -11.988 2.151 0.985 -8.618 ...
# .. ..$ coefficients : num [1:3, 1:4] 1.929 3.963 -3.533 1.102 0.996 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 8.36
# .. ..$ df : int [1:3] 3 331 3
# .. ..$ r.squared : num 0.0456
# .. ..$ adj.r.squared: num 0.0399
# .. ..$ fstatistic : Named num [1:3] 7.91 2 331
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.01738 0.00101 -0.02441 0.00101 0.01419 ...
# .. ..$ na.action : 'omit' Named int [1:154] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Low income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:29] 0.468 3.424 0.415 3.842 3.342 ...
# .. ..$ coefficients : num [1:2, 1:4] -6.91 11.38 2.11 3.64 -3.27 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE TRUE
# .. ..$ sigma : num 6.05
# .. ..$ df : int [1:3] 2 27 3
# .. ..$ r.squared : num 0.266
# .. ..$ adj.r.squared: num 0.239
# .. ..$ fstatistic : Named num [1:3] 9.81 1 27
# .. ..$ cov.unscaled : num [1:2, 1:2] 0.122 -0.178 -0.178 0.361
# .. ..$ na.action : 'omit' Named int [1:93] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Lower middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:191] -0.95 -2.047 4.541 5.594 -0.723 ...
# .. ..$ coefficients : num [1:3, 1:4] 2.238 1.271 -0.647 1.002 0.599 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 3.94
# .. ..$ df : int [1:3] 3 188 3
# .. ..$ r.squared : num 0.0244
# .. ..$ adj.r.squared: num 0.014
# .. ..$ fstatistic : Named num [1:3] 2.35 2 188
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.06471 -0.00043 -0.07801 -0.00043 0.02309 ...
# .. ..$ na.action : 'omit' Named int [1:114] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Upper middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:263] -18.068 -23.976 28.692 0.858 1.141 ...
# .. ..$ coefficients : num [1:3, 1:4] 2.663 0.718 -1.19 3.538 1.318 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 13.8
# .. ..$ df : int [1:3] 3 260 3
# .. ..$ r.squared : num 0.00119
# .. ..$ adj.r.squared: num -0.00649
# .. ..$ fstatistic : Named num [1:3] 0.155 2 260
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.065741 0.000795 -0.084456 0.000795 0.009122 ...
# .. ..$ na.action : 'omit' Named int [1:103] 1 61 62 122 123 124 125 126 127 128 ...
# $ North America :List of 1
# ..$ High income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:137] 4.6986 -3.1098 1.8243 0.5643 0.0176 ...
# .. ..$ coefficients : num [1:3, 1:4] 6.542 -1.461 -19.53 2.272 0.662 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 2.49
# .. ..$ df : int [1:3] 3 134 3
# .. ..$ r.squared : num 0.0657
# .. ..$ adj.r.squared: num 0.0518
# .. ..$ fstatistic : Named num [1:3] 4.71 2 134
# .. ..$ cov.unscaled : num [1:3, 1:3] 8.36e-01 1.59e-17 -3.60 1.59e-17 7.10e-02 ...
# .. ..$ na.action : 'omit' Named int [1:46] 1 2 3 4 5 6 7 8 9 10 ...
# $ South Asia :List of 3
# ..$ Low income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:76] 0.544 -6.17 3.951 -0.964 7.829 ...
# .. ..$ coefficients : num [1:3, 1:4] -108.62 -1.72 96.06 174.19 1.25 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 3.7
# .. ..$ df : int [1:3] 3 73 3
# .. ..$ r.squared : num 0.0494
# .. ..$ adj.r.squared: num 0.0233
# .. ..$ fstatistic : Named num [1:3] 1.9 2 73
# .. ..$ cov.unscaled : num [1:3, 1:3] 2210.639 -6.979 -1875.261 -6.979 0.114 ...
# .. ..$ na.action : 'omit' Named int [1:46] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Lower middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:216] 0.294 -0.293 -6.067 4.954 -4.164 ...
# .. ..$ coefficients : num [1:3, 1:4] -2.232 0.238 5.972 1.074 0.493 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 3.44
# .. ..$ df : int [1:3] 3 213 3
# .. ..$ r.squared : num 0.111
# .. ..$ adj.r.squared: num 0.103
# .. ..$ fstatistic : Named num [1:3] 13.3 2 213
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.09757 -0.00201 -0.10483 -0.00201 0.02054 ...
# .. ..$ na.action : 'omit' Named int [1:28] 1 61 62 63 64 65 66 67 68 69 ...
# ..$ Upper middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:82] 3.262 3.976 3.128 1.67 -0.901 ...
# .. ..$ coefficients : num [1:3, 1:4] 3.859 -0.577 -0.476 1.036 1.365 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 4.25
# .. ..$ df : int [1:3] 3 79 3
# .. ..$ r.squared : num 0.00622
# .. ..$ adj.r.squared: num -0.0189
# .. ..$ fstatistic : Named num [1:3] 0.247 2 79
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.0595 -0.028 -0.0473 -0.028 0.1034 ...
# .. ..$ na.action : 'omit' Named int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
# $ Sub-Saharan Africa :List of 4
# ..$ High income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:39] -11.33 -5.041 -3.158 0.585 7.81 ...
# .. ..$ coefficients : num [1:2, 1:4] 2.551 -0.644 0.775 0.55 3.293 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE TRUE
# .. ..$ sigma : num 4.8
# .. ..$ df : int [1:3] 2 37 3
# .. ..$ r.squared : num 0.0357
# .. ..$ adj.r.squared: num 0.00959
# .. ..$ fstatistic : Named num [1:3] 1.37 1 37
# .. ..$ cov.unscaled : num [1:2, 1:2] 0.026 -0.00217 -0.00217 0.01312
# .. ..$ na.action : 'omit' Named int [1:22] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Low income :List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:1085] 0.694 -5.869 2.069 3.855 2.415 ...
# .. ..$ coefficients : num [1:3, 1:4] -0.0756 0.5308 0.5124 0.8887 0.137 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 5.88
# .. ..$ df : int [1:3] 3 1082 3
# .. ..$ r.squared : num 0.0146
# .. ..$ adj.r.squared: num 0.0128
# .. ..$ fstatistic : Named num [1:3] 8.01 2 1082
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.022858 -0.000025 -0.025534 -0.000025 0.000543 ...
# .. ..$ na.action : 'omit' Named int [1:379] 1 61 62 122 123 183 184 244 245 305 ...
# ..$ Lower middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:891] -8.2839 -4.0289 0.0449 1.8231 -0.5267 ...
# .. ..$ coefficients : num [1:3, 1:4] 2.352 0.782 -2.616 0.608 0.169 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 5.27
# .. ..$ df : int [1:3] 3 888 3
# .. ..$ r.squared : num 0.0277
# .. ..$ adj.r.squared: num 0.0255
# .. ..$ fstatistic : Named num [1:3] 12.7 2 888
# .. ..$ cov.unscaled : num [1:3, 1:3] 1.33e-02 -1.13e-05 -2.00e-02 -1.13e-05 1.02e-03 ...
# .. ..$ na.action : 'omit' Named int [1:146] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Upper middle income:List of 12
# .. ..$ call : language FUN(formula = ..1, data = y)
# .. ..$ terms :Classes 'terms', 'formula' language G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)
# .. ..$ residuals : Named num [1:298] 0.7659 0.9133 0.0921 0.996 0.0765 ...
# .. ..$ coefficients : num [1:3, 1:4] 0.584 0.456 4.112 2.472 0.652 ...
# .. ..$ aliased : Named logi [1:3] FALSE FALSE FALSE
# .. ..$ sigma : num 11.4
# .. ..$ df : int [1:3] 3 295 3
# .. ..$ r.squared : num 0.00658
# .. ..$ adj.r.squared: num -0.000152
# .. ..$ fstatistic : Named num [1:3] 0.977 2 295
# .. ..$ cov.unscaled : num [1:3, 1:3] 0.047213 0.000438 -0.070778 0.000438 0.003285 ...
# .. ..$ na.action : 'omit' Named int [1:68] 1 61 62 63 64 65 66 67 68 69 ...
We can turn this list into a data.table again by calling
first get_elem
to recursively extract the coefficient
matrices and then unlist2d
to recursively bind them to a
new data.table:
lm_summary_list %>%
get_elem("coefficients") %>%
unlist2d(idcols = .c(Region, Income),
row.names = "Coef",
DT = TRUE) %>% head
# Region Income Coef Estimate Std. Error t value
# 1: East Asia & Pacific High income (Intercept) 0.5313479 0.7058550 0.7527720
# 2: East Asia & Pacific High income G(LIFEEX) 2.4935584 0.7586943 3.2866443
# 3: East Asia & Pacific High income B(G(LIFEEX), country) 3.8297123 1.6916770 2.2638554
# 4: East Asia & Pacific Lower middle income (Intercept) 1.3476602 0.7008556 1.9228785
# 5: East Asia & Pacific Lower middle income G(LIFEEX) 0.5238856 0.7574904 0.6916069
# 6: East Asia & Pacific Lower middle income B(G(LIFEEX), country) 0.9494439 1.2031228 0.7891496
# Pr(>|t|)
# 1: 0.451991327
# 2: 0.001095466
# 3: 0.024071386
# 4: 0.055015131
# 5: 0.489478164
# 6: 0.430367103
The fact that this is a nested list of matrices, and that we can save
both the names of the lists at each level of nesting and the row- and
column- names of the matrices make unlist2d
a significant
generalization of rbindlist
3.
But why do all this fuzz if we could have simply done:?
DT[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country))), "Coef"),
keyby = .(region, income)] %>% head
# region income Coef Estimate Std. Error t value
# 1: East Asia & Pacific High income (Intercept) 0.5313479 0.7058550 0.7527720
# 2: East Asia & Pacific High income G(LIFEEX) 2.4935584 0.7586943 3.2866443
# 3: East Asia & Pacific High income B(G(LIFEEX), country) 3.8297123 1.6916770 2.2638554
# 4: East Asia & Pacific Lower middle income (Intercept) 1.3476602 0.7008556 1.9228785
# 5: East Asia & Pacific Lower middle income G(LIFEEX) 0.5238856 0.7574904 0.6916069
# 6: East Asia & Pacific Lower middle income B(G(LIFEEX), country) 0.9494439 1.2031228 0.7891496
# Pr(>|t|)
# 1: 0.451991327
# 2: 0.001095466
# 3: 0.024071386
# 4: 0.055015131
# 5: 0.489478164
# 6: 0.430367103
Well we might want to do more things with that list of linear models first before tidying it, so this is a more general workflow. We might also be interested in additional statistics like the R-squared or the F-statistic:
DT_sum <- lm_summary_list %>%
get_elem("coef|r.sq|fstat", regex = TRUE) %>%
unlist2d(idcols = .c(Region, Income, Statistic),
row.names = "Coef",
DT = TRUE)
head(DT_sum)
# Region Income Statistic Coef Estimate Std. Error
# 1: East Asia & Pacific High income coefficients (Intercept) 0.5313479 0.7058550
# 2: East Asia & Pacific High income coefficients G(LIFEEX) 2.4935584 0.7586943
# 3: East Asia & Pacific High income coefficients B(G(LIFEEX), country) 3.8297123 1.6916770
# 4: East Asia & Pacific High income r.squared <NA> NA NA
# 5: East Asia & Pacific High income adj.r.squared <NA> NA NA
# 6: East Asia & Pacific High income fstatistic <NA> NA NA
# t value Pr(>|t|) V1 value numdf dendf
# 1: 0.752772 0.451991327 NA NA NA NA
# 2: 3.286644 0.001095466 NA NA NA NA
# 3: 2.263855 0.024071386 NA NA NA NA
# 4: NA NA 0.05245359 NA NA NA
# 5: NA NA 0.04812690 NA NA NA
# 6: NA NA NA 12.12325 2 438
# Reshaping to long form:
DT_sum %>%
melt(1:4, na.rm = TRUE) %>%
roworderv(1:2) %>% head(20)
# Region Income Statistic Coef variable
# 1: East Asia & Pacific High income coefficients (Intercept) Estimate
# 2: East Asia & Pacific High income coefficients G(LIFEEX) Estimate
# 3: East Asia & Pacific High income coefficients B(G(LIFEEX), country) Estimate
# 4: East Asia & Pacific High income coefficients (Intercept) Std. Error
# 5: East Asia & Pacific High income coefficients G(LIFEEX) Std. Error
# 6: East Asia & Pacific High income coefficients B(G(LIFEEX), country) Std. Error
# 7: East Asia & Pacific High income coefficients (Intercept) t value
# 8: East Asia & Pacific High income coefficients G(LIFEEX) t value
# 9: East Asia & Pacific High income coefficients B(G(LIFEEX), country) t value
# 10: East Asia & Pacific High income coefficients (Intercept) Pr(>|t|)
# 11: East Asia & Pacific High income coefficients G(LIFEEX) Pr(>|t|)
# 12: East Asia & Pacific High income coefficients B(G(LIFEEX), country) Pr(>|t|)
# 13: East Asia & Pacific High income r.squared <NA> V1
# 14: East Asia & Pacific High income adj.r.squared <NA> V1
# 15: East Asia & Pacific High income fstatistic <NA> value
# 16: East Asia & Pacific High income fstatistic <NA> numdf
# 17: East Asia & Pacific High income fstatistic <NA> dendf
# 18: East Asia & Pacific Lower middle income coefficients (Intercept) Estimate
# 19: East Asia & Pacific Lower middle income coefficients G(LIFEEX) Estimate
# 20: East Asia & Pacific Lower middle income coefficients B(G(LIFEEX), country) Estimate
# value
# 1: 5.313479e-01
# 2: 2.493558e+00
# 3: 3.829712e+00
# 4: 7.058550e-01
# 5: 7.586943e-01
# 6: 1.691677e+00
# 7: 7.527720e-01
# 8: 3.286644e+00
# 9: 2.263855e+00
# 10: 4.519913e-01
# 11: 1.095466e-03
# 12: 2.407139e-02
# 13: 5.245359e-02
# 14: 4.812690e-02
# 15: 1.212325e+01
# 16: 2.000000e+00
# 17: 4.380000e+02
# 18: 1.347660e+00
# 19: 5.238856e-01
# 20: 9.494439e-01
As a final example of this kind, lets suppose we are interested in the within-country correlations of all these variables by region and income group:
DT[, qDT(pwcor(W(.SD, country)), "Variable"),
keyby = .(region, income), .SDcols = PCGDP:ODA] %>% head
# region income Variable W.PCGDP W.LIFEEX W.GINI W.ODA
# 1: East Asia & Pacific High income W.PCGDP 1.0000000 0.7562668 0.6253844 -0.25258496
# 2: East Asia & Pacific High income W.LIFEEX 0.7562668 1.0000000 0.3191255 -0.33611662
# 3: East Asia & Pacific High income W.GINI 0.6253844 0.3191255 1.0000000 NA
# 4: East Asia & Pacific High income W.ODA -0.2525850 -0.3361166 NA 1.00000000
# 5: East Asia & Pacific Lower middle income W.PCGDP 1.0000000 0.4685618 0.4428879 -0.02508852
# 6: East Asia & Pacific Lower middle income W.LIFEEX 0.4685618 1.0000000 0.3231520 0.09356733
In summary: The list processing features, statistical capabilities and efficient converters of collapse and the flexibility of data.table work well together, facilitating more complex workflows.
These are all run on a 2 core laptop, so I honestly don’t know how collapse scales on powerful multi-core machines. My own limited computational resources are part of the reason I did not opt for a thread-parallel package from the start. But a multi-core version of collapse will eventually be released, maybe by end of 2021.
Mundlak, Yair. 1978. “On the Pooling of Time Series and Cross Section Data.” Econometrica 46 (1): 69–85.
Grouping on numeric variables in collapse is always ordered.↩︎
Apart from collapse::BY
which is only an
auxiliary function written in base R to perform flexible split-apply
combine computing on vectors, matrices and data frames.↩︎
unlist2d
can similarly bind nested lists of
arrays, data frames or data.table’s↩︎