collapse and dplyr
Fast (Weighted) Aggregations and Transformations in a Piped Workflow
Sebastian Krantz
2021-01-04
Source:vignettes/collapse_and_dplyr.Rmd
collapse_and_dplyr.Rmd
This vignette focuses on the integration of collapse and the popular dplyr package by Hadley Wickham. In particular it will demonstrate how using collapse’s fast functions and some fast alternatives for dplyr verbs can substantially facilitate and speed up basic data manipulation, grouped and weighted aggregations and transformations, and panel data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a dplyr (piped) workflow.
Notes:
This vignette is targeted at dplyr / tidyverse users. collapse is a standalone package and can be programmed efficiently without pipes or dplyr verbs.
The ‘Introduction to collapse’ vignette provides a thorough introduction to the package and a built-in structured documentation is available under
help("collapse-documentation")
after installing the package. In additionhelp("collapse-package")
provides a compact set of examples for quick-start.Documentation and vignettes can also be viewed online.
1. Fast Aggregations
A key feature of collapse is it’s broad set of Fast
Statistical Functions
(fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct
)
which are able to substantially speed-up column-wise, grouped and
weighted computations on vectors, matrices or data frames. The functions
are S3 generic, with a default (vector), matrix and data frame method,
as well as a grouped_df method for grouped tibbles used by
dplyr. The grouped tibble method has the following
arguments:
FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...)
where w
is a weight variable, and TRA
and
can be used to transform x
using the computed statistics
and one of 10 available transformations
("replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%", "-%%"
,
discussed in section 2). na.rm
efficiently removes missing
values and is TRUE
by default. use.g.names
generates new row-names from the unique combinations of groups (default:
disabled), whereas keep.group_vars
(default: enabled) will
keep the grouping columns as is custom in the native
data %>% group_by(...) %>% summarize(...)
workflow in
dplyr. Finally, keep.w
regulates whether a
weighting variable used is also aggregated and saved in a column. For
fsum, fmean, fmedian, fnth, fvar, fsd
and
fmode
this will compute the sum of the weights in each
group, whereas fprod
returns the product of the
weights.
With that in mind, let’s consider some straightforward applications.
1.1 Simple Aggregations
Consider the Groningen Growth and Development Center 10-Sector Database included in collapse and introduced in the main vignette:
library(collapse)
head(GGDC10S)
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA
# 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA
# 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA
# 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA
# 5 BWA SSA Sub-saharan Africa VA 1964 16.30154 3.494075 0.7365696 0.1043936
# 6 BWA SSA Sub-saharan Africa VA 1965 15.72700 2.495768 1.0181992 0.1350976
# CON WRT TRA FIRE GOV OTH SUM
# 1 NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA
# 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
# 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710
# Summarize the Data:
# descr(GGDC10S, cols = is_categorical)
# aperm(qsu(GGDC10S, ~Variable, cols = is.numeric))
# Efficiently converting to tibble (no deep copy)
GGDC10S <- qTBL(GGDC10S)
Simple column-wise computations using the fast functions and pipe operators are performed as follows:
library(dplyr)
GGDC10S %>% fnobs # Number of Observations
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 5027 5027 5027 5027 5027 4364 4355 4355 4354
# CON WRT TRA FIRE GOV OTH SUM
# 4355 4355 4355 4355 3482 4248 4364
GGDC10S %>% fndistinct # Number of distinct values
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 43 6 6 2 67 4353 4224 4353 4237
# CON WRT TRA FIRE GOV OTH SUM
# 4339 4344 4334 4349 3470 4238 4364
GGDC10S %>% select_at(6:16) %>% fmedian # Median
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# 4394.5194 173.2234 3718.0981 167.9500 1473.4470 3773.6430 1174.8000 960.1251 3928.5127
# OTH SUM
# 1433.1722 23186.1936
GGDC10S %>% select_at(6:16) %>% fmean # Mean
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# 2526696.5 1867908.9 5538491.4 335679.5 1801597.6 3392909.5 1473269.7 1657114.8 1712300.3
# OTH SUM
# 1684527.3 21566436.8
GGDC10S %>% fmode # Mode
# Country Regioncode Region Variable Year
# "USA" "ASI" "Asia" "EMP" "2010"
# AGR MIN MAN PU CON
# "171.315882316326" "0" "4645.12507642586" "0" "1.34623115930777"
# WRT TRA FIRE GOV OTH
# "21.8380052682527" "8.97743416914571" "40.0701608636442" "0" "3626.84423577048"
# SUM
# "37.4822945751317"
GGDC10S %>% fmode(drop = FALSE) # Keep data structure intact
# # A tibble: 1 × 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 USA ASI Asia EMP 2010 171. 0 4645. 0 1.35 21.8 8.98 40.1 0
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
Moving on to grouped statistics, we can compute the average value added and employment by sector and country using:
GGDC10S %>%
group_by(Variable, Country) %>%
select_at(6:16) %>% fmean
# # A tibble: 85 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 1.02e2 7.42e2 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35e0 1.23e2 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 3.65e2 3.52e3 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09e0 2.53e1 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 2.94e1 2.96e2 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1.61e3 2.09e4 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
# 7 EMP COL 3091. 145. 1175. 3.39e1 5.24e2 2.07e3 4.70e2 649. NA 1.73e3 9.89e3
# 8 EMP CRI 231. 1.70 136. 1.43e1 5.76e1 1.57e2 4.24e1 54.9 128. 6.51e1 8.87e2
# 9 EMP DEW 2490. 407. 8473. 2.26e2 2.09e3 4.44e3 1.48e3 1689. 3945. 9.99e2 2.62e4
# 10 EMP DNK 236. 8.03 507. 1.38e1 1.71e2 4.55e2 1.61e2 181. 549. 1.11e2 2.39e3
# # ℹ 75 more rows
Similarly we can aggregate using any other of the above functions.
It is important to not use dplyr’s summarize
together with these functions since that would eliminate their speed
gain. These functions are fast because they are executed only once and
carry out the grouped computations in C++, whereas
summarize
will apply the function to each group in the
grouped tibble.
Excursus: What is Happening Behind the Scenes?
To better explain this point it is perhaps good to shed some light on what is happening behind the scenes of dplyr and collapse. Fundamentally both packages follow different computing paradigms:
dplyr is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of dplyr. When a data.frame is passed through group_by, a ‘groups’ attribute is attached:
GGDC10S %>% group_by(Variable, Country) %>% attr("groups")
# # A tibble: 85 × 3
# Variable Country .rows
# <chr> <chr> <list<int>>
# 1 EMP ARG [62]
# 2 EMP BOL [61]
# 3 EMP BRA [62]
# 4 EMP BWA [52]
# 5 EMP CHL [63]
# 6 EMP CHN [62]
# 7 EMP COL [61]
# 8 EMP CRI [62]
# 9 EMP DEW [61]
# 10 EMP DNK [64]
# # ℹ 75 more rows
This object is a data.frame giving the unique groups and in the third
(last) column vectors containing the indices of the rows belonging to
that group. A command like summarize
uses this information
to split the data.frame into groups which are then passed sequentially
to the function used and later recombined. These steps are also done in
C++ which makes dplyr quite efficient.
Now collapse is based around one-pass grouped computations
at the C++ level using its own grouped statistical functions. In other
words the data is not split and recombined at all but the entire
computation is performed in a single C++ loop running through that data
and completing the computations for each group simultaneously. This
modus operandi is also evident in collapse grouping objects.
The method GRP.grouped_df
takes a dplyr grouping
object from a grouped tibble and efficiently converts it to a
collapse grouping object:
GGDC10S %>% group_by(Variable, Country) %>% GRP %>% str
# Class 'GRP' hidden list of 9
# $ N.groups : int 85
# $ group.id : int [1:5027] 46 46 46 46 46 46 46 46 46 46 ...
# $ group.sizes : int [1:85] 62 61 62 52 63 62 61 62 61 64 ...
# $ groups :List of 2
# ..$ Variable: chr [1:85] "EMP" "EMP" "EMP" "EMP" ...
# .. ..- attr(*, "label")= chr "Variable"
# .. ..- attr(*, "format.stata")= chr "%9s"
# ..$ Country : chr [1:85] "ARG" "BOL" "BRA" "BWA" ...
# .. ..- attr(*, "label")= chr "Country"
# .. ..- attr(*, "format.stata")= chr "%9s"
# $ group.vars : chr [1:2] "Variable" "Country"
# $ ordered : Named logi [1:2] TRUE FALSE
# ..- attr(*, "names")= chr [1:2] "ordered" "sorted"
# $ order : NULL
# $ group.starts: NULL
# $ call : language GRP.grouped_df(X = .)
This object is a list where the first three elements give the number
of groups, the group-id to which each row belongs and a vector of
group-sizes. A function like fsum
uses this information to
(for each column) create a result vector of size ‘N.groups’ and the run
through the column using the ‘group.id’ vector to add the i’th data
point to the ’group.id[i]’th element of the result vector. When the loop
is finished, the grouped computation is also finished.
It is obvious that collapse is faster than dplyr since it’s method of computing involves less steps, and it does not need to call statistical functions multiple times. See the benchmark section.
1.2 More Speed using collapse Verbs
collapse fast functions do not develop their maximal
performance on a grouped tibble created with group_by
because of the additional conversion cost of the grouping object
incurred by GRP.grouped_df
. This cost is already minimized
through the use of C++, but we can do even better replacing
group_by
with collapse::fgroup_by
.
fgroup_by
works like group_by
but does the
grouping with collapse::GRP
(up to 10x faster than
group_by
) and simply attaches a collapse grouping
object to the grouped_df. Thus the speed gain is 2-fold: Faster grouping
and no conversion cost when calling collapse functions.
Another improvement comes from replacing the dplyr verb
select
with collapse::fselect
, and, for
selection using column names, indices or functions use
collapse::get_vars
instead of select_at
or
select_if
. Next to get_vars
, collapse
also introduces the predicates num_vars
,
cat_vars
, char_vars
, fact_vars
,
logi_vars
and date_vars
to efficiently select
columns by type.
GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian
# # A tibble: 85 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1325. 47.4 1988. 1.05e2 7.82e2 1.85e3 5.80e2 464. 1739. 866. 9.74e3
# 2 EMP BOL 943. 53.5 167. 4.46e0 6.60e1 1.32e2 9.70e1 15.3 NA 384. 1.84e3
# 3 EMP BRA 17481. 225. 7208. 3.76e2 4.05e3 6.45e3 1.58e3 4355. 4450. 4479. 5.19e4
# 4 EMP BWA 175. 12.2 13.1 3.71e0 1.90e1 2.11e1 6.75e0 10.4 53.8 31.2 3.61e2
# 5 EMP CHL 690. 93.9 607. 2.58e1 2.30e2 4.84e2 2.05e2 106. NA 900. 3.31e3
# 6 EMP CHN 293915 8150. 61761. 1.14e3 1.06e4 1.70e4 9.56e3 4328. 19468. 9954. 4.45e5
# 7 EMP COL 3006. 84.0 1033. 3.71e1 4.19e2 1.55e3 3.91e2 655. NA 1430. 8.63e3
# 8 EMP CRI 216. 1.49 114. 7.92e0 5.50e1 8.98e1 2.55e1 19.6 122. 60.6 7.19e2
# 9 EMP DEW 2178 320. 8459. 2.47e2 2.10e3 4.45e3 1.53e3 1656 3700 900 2.65e4
# 10 EMP DNK 187. 3.75 508. 1.36e1 1.65e2 4.61e2 1.61e2 169. 642. 104. 2.42e3
# # ℹ 75 more rows
microbenchmark(collapse = GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian,
hybrid = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmedian,
dplyr = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% summarise_all(median, na.rm = TRUE))
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 334.136 411.4805 474.4216 458.617 530.3755 845.998 100
# hybrid 6234.412 7188.2745 8465.1353 8196.616 8819.1585 24021.236 100
# dplyr 38419.602 44488.8350 52952.8046 49954.128 57241.9500 130559.935 100
Benchmarks on the different components of this code and with larger
data are provided under ‘Benchmarks’. Note that a grouped tibble created
with fgroup_by
can no longer be used for grouped
computations with dplyr verbs like mutate
or
summarize
. fgroup_by
first assigns the class
GDP_df which is for printing grouping information and
subsetting, then the object classes (tbl_df,
data.table or whatever else), followed by classes
grouped_df and data.frame, and adds the grouping
object in a ‘groups’ attribute. Since tbl_df is assigned before
grouped_df, the object is treated by the dplyr
ecosystem like a normal tibble.
class(group_by(GGDC10S, Variable, Country))
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
class(fgroup_by(GGDC10S, Variable, Country))
# [1] "GRP_df" "tbl_df" "tbl" "grouped_df" "data.frame"
The function fungroup
removes classes ‘GDP_df’ and
‘grouped_df’ and the ‘groups’ attribute (and can thus also be used for
grouped tibbles created with dplyr::group_by
).
Note that any kind of data frame based class can be grouped with
fgroup_by
, and still retain full responsiveness to all
methods defined for that class. Functions performing aggregation on the
grouped data frame remove the grouping object and classes afterwards,
yielding an object with the same class and attributes as the input.
The print method shown below reports the grouping variables, and then
in square brackets the information
[number of groups | average group size (standard-deviation of group sizes)]
:
fgroup_by(GGDC10S, Variable, Country)
# # A tibble: 5,027 × 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s… VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA SSA Sub-s… VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA SSA Sub-s… VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA SSA Sub-s… VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA SSA Sub-s… VA 1964 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82
# 6 BWA SSA Sub-s… VA 1965 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70
# 7 BWA SSA Sub-s… VA 1966 17.7 1.97 0.804 0.203 1.35 8.27 2.15 1.36 6.37
# 8 BWA SSA Sub-s… VA 1967 19.1 2.30 0.938 0.203 0.897 4.31 1.72 1.54 7.04
# 9 BWA SSA Sub-s… VA 1968 21.1 1.84 0.750 0.203 1.22 5.17 2.44 1.03 5.03
# 10 BWA SSA Sub-s… VA 1969 21.9 5.24 2.14 0.578 3.47 5.75 2.72 1.23 5.59
# # ℹ 5,017 more rows
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
Note further that fselect
and get_vars
are
not full drop-in replacements for select
because they do
not have a grouped_df method:
GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% tail(3)
# # A tibble: 3 × 13
# # Groups: Variable, Country [1]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP EGY 5206. 29.0 2436. 307. 2733. 2977. 1992. 801. 5539. NA 22020.
# 2 EMP EGY 5186. 27.6 2374. 318. 2795. 3020. 2048. 815. 5636. NA 22219.
# 3 EMP EGY 5161. 24.8 2348. 325. 2931. 3110. 2065. 832. 5736. NA 22533.
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% tail(3)
# # A tibble: 3 × 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 5206. 29.0 2436. 307. 2733. 2977. 1992. 801. 5539. NA 22020.
# 2 5186. 27.6 2374. 318. 2795. 3020. 2048. 815. 5636. NA 22219.
# 3 5161. 24.8 2348. 325. 2931. 3110. 2065. 832. 5736. NA 22533.
Since by default keep.group_vars = TRUE
in the Fast
Statistical Functions, the end result is nevertheless the same:
GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmean %>% tail(3)
# # A tibble: 3 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA VEN 6860. 35478. 1.96e4 1.06e3 1.17e4 1.93e4 8.03e3 5.60e3 NA 19986. 1.28e5
# 2 VA ZAF 16419. 42928. 8.76e4 1.38e4 1.64e4 6.83e4 4.53e4 6.64e4 7.58e4 30167. 4.63e5
# 3 VA ZMB 1268849. 1006099. 9.00e5 2.19e5 8.66e5 2.10e6 7.05e5 9.10e5 1.10e6 81871. 9.16e6
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% fmean %>% tail(3)
# # A tibble: 3 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA VEN 6860. 35478. 1.96e4 1.06e3 1.17e4 1.93e4 8.03e3 5.60e3 NA 19986. 1.28e5
# 2 VA ZAF 16419. 42928. 8.76e4 1.38e4 1.64e4 6.83e4 4.53e4 6.64e4 7.58e4 30167. 4.63e5
# 3 VA ZMB 1268849. 1006099. 9.00e5 2.19e5 8.66e5 2.10e6 7.05e5 9.10e5 1.10e6 81871. 9.16e6
Another useful verb introduced by collapse is
fgroup_vars
, which can be used to efficiently obtain the
grouping columns or grouping variables from a grouped tibble:
# fgroup_by fully supports grouped tibbles created with group_by or fgroup_by:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 × 2
# Variable Country
# <chr> <chr>
# 1 VA BWA
# 2 VA BWA
# 3 VA BWA
GGDC10S %>% fgroup_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 × 2
# Variable Country
# <chr> <chr>
# 1 VA BWA
# 2 VA BWA
# 3 VA BWA
# The other possibilities:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("unique") %>% head(3)
# # A tibble: 3 × 2
# Variable Country
# <chr> <chr>
# 1 EMP ARG
# 2 EMP BOL
# 3 EMP BRA
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("names")
# [1] "Variable" "Country"
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("indices")
# [1] 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_indices")
# Variable Country
# 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("logical")
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_logical")
# Country Regioncode Region Variable Year AGR MIN MAN PU
# TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# CON WRT TRA FIRE GOV OTH SUM
# FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Another collapse verb to mention here is
fsubset
, a faster alternative to dplyr::filter
which also provides an option to flexibly subset columns after the
select argument:
# Two equivalent calls, the first is substantially faster
GGDC10S %>% fsubset(Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(3)
# # A tibble: 3 × 11
# Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA 1991 303. 2647. 473. 161. 580. 807. 233. 433. 1073.
# 2 BWA 1992 333. 2691. 537. 178. 679. 725. 285. 517. 1234.
# 3 BWA 1993 405. 2625. 567. 219. 634. 772. 350. 673. 1487.
GGDC10S %>% filter(Variable == "VA" & Year > 1990) %>% select(Country, Year, AGR:GOV) %>% head(3)
# # A tibble: 3 × 11
# Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA 1991 303. 2647. 473. 161. 580. 807. 233. 433. 1073.
# 2 BWA 1992 333. 2691. 537. 178. 679. 725. 285. 517. 1234.
# 3 BWA 1993 405. 2625. 567. 219. 634. 772. 350. 673. 1487.
collapse also offers roworder
,
frename
, colorder
and
ftransform
/TRA
as fast replacements for
dplyr::arrange
, dplyr::rename
,
dplyr::relocate
and dplyr::mutate
.
1.3 Multi-Function Aggregations
One can also aggregate with multiple functions at the same time. For
such operations it is often necessary to use curly braces {
to prevent first argument injection so that
%>% cbind(FUN1(.), FUN2(.))
does not evaluate as
%>% cbind(., FUN1(.), FUN2(.))
:
GGDC10S %>%
fgroup_by(Variable, Country) %>%
get_vars(6:16) %>% {
cbind(fmedian(.),
add_stub(fmean(., keep.group_vars = FALSE), "mean_"))
} %>% head(3)
# Variable Country AGR MIN MAN PU CON WRT TRA
# 1 EMP ARG 1324.5255 47.35255 1987.5912 104.738825 782.40283 1854.612 579.93982
# 2 EMP BOL 943.1612 53.53538 167.1502 4.457895 65.97904 132.225 96.96828
# 3 EMP BRA 17480.9810 225.43693 7207.7915 375.851832 4054.66103 6454.523 1580.81120
# FIRE GOV OTH SUM mean_AGR mean_MIN mean_MAN mean_PU mean_CON
# 1 464.39920 1738.836 866.1119 9743.223 1419.8013 52.08903 1931.7602 101.720936 742.4044
# 2 15.34259 NA 384.0678 1842.055 964.2103 56.03295 235.0332 5.346433 122.7827
# 3 4354.86210 4449.942 4478.6927 51881.110 17191.3529 206.02389 6991.3710 364.573404 3524.7384
# mean_WRT mean_TRA mean_FIRE mean_GOV mean_OTH mean_SUM
# 1 1982.1775 648.5119 627.79291 2043.471 992.4475 10542.177
# 2 281.5164 115.4728 44.56442 NA 395.5650 2220.524
# 3 8509.4612 2054.3731 4413.54448 5307.280 5710.2665 54272.985
The function add_stub
used above is a collapse
function adding a prefix (default) or suffix to variables names. The
collapse predicate add_vars
provides a more
efficient alternative to cbind.data.frame
. The idea here is
‘adding’ variables to the data.frame in the first argument i.e. the
attributes of the first argument are preserved, so the expression below
still gives a tibble instead of a data.frame:
GGDC10S %>%
fgroup_by(Variable, Country) %>% {
add_vars(get_vars(., "Reg", regex = TRUE) %>% ffirst, # Regular expression matching column names
num_vars(.) %>% fmean(keep.group_vars = FALSE) %>% add_stub("mean_"), # num_vars selects all numeric variables
fselect(., PU:TRA) %>% fmedian(keep.group_vars = FALSE) %>% add_stub("median_"),
fselect(., PU:CON) %>% fmin(keep.group_vars = FALSE) %>% add_stub("min_"))
} %>% head(3)
# # A tibble: 3 × 22
# Variable Country Regioncode Region mean_Year mean_AGR mean_MIN mean_MAN mean_PU mean_CON mean_WRT
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin … 1980. 1420. 52.1 1932. 102. 742. 1982.
# 2 EMP BOL LAM Latin … 1980 964. 56.0 235. 5.35 123. 282.
# 3 EMP BRA LAM Latin … 1980. 17191. 206. 6991. 365. 3525. 8509.
# # ℹ 11 more variables: mean_TRA <dbl>, mean_FIRE <dbl>, mean_GOV <dbl>, mean_OTH <dbl>,
# # mean_SUM <dbl>, median_PU <dbl>, median_CON <dbl>, median_WRT <dbl>, median_TRA <dbl>,
# # min_PU <dbl>, min_CON <dbl>
Another nice feature of add_vars
is that it can also
very efficiently reorder columns i.e. bind columns in a different order
than they are passed. This can be done by simply specifying the
positions the added columns should have in the final data frame, and
then add_vars
shifts the first argument columns to the
right to fill in the gaps.
GGDC10S %>%
fsubset(Variable == "VA", Country, AGR, SUM) %>%
fgroup_by(Country) %>% {
add_vars(fgroup_vars(.,"unique"),
fmean(., keep.group_vars = FALSE) %>% add_stub("mean_"),
fsd(., keep.group_vars = FALSE) %>% add_stub("sd_"),
pos = c(2,4,3,5))
} %>% head(3)
# # A tibble: 3 × 5
# Country mean_AGR sd_AGR mean_SUM sd_SUM
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ARG 14951. 33061. 152534. 301316.
# 2 BOL 3300. 4456. 22619. 33173.
# 3 BRA 76870. 59442. 1200563. 976963.
A much more compact solution to multi-function and multi-type aggregation is offered by the function collapg:
# This aggregates numeric colums using the mean (fmean) and categorical columns with the mode (fmode)
GGDC10S %>% fgroup_by(Variable, Country) %>% collapg %>% head(3)
# # A tibble: 3 × 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin … 1980. 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043.
# 2 EMP BOL LAM Latin … 1980 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA
# 3 EMP BRA LAM Latin … 1980. 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307.
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
By default it aggregates numeric columns using the fmean
and categorical columns using fmode
, and preserves the
order of all columns. Changing these defaults is very easy:
# This aggregates numeric colums using the median and categorical columns using the first value
GGDC10S %>% fgroup_by(Variable, Country) %>% collapg(fmedian, flast) %>% head(3)
# # A tibble: 3 × 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin Amer… 1980. 1325. 47.4 1988. 105. 782. 1855. 580. 464.
# 2 EMP BOL LAM Latin Amer… 1980 943. 53.5 167. 4.46 66.0 132. 97.0 15.3
# 3 EMP BRA LAM Latin Amer… 1980. 17481. 225. 7208. 376. 4055. 6455. 1581. 4355.
# # ℹ 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>
One can apply multiple functions to both numeric and/or categorical data:
GGDC10S %>% fgroup_by(Variable, Country) %>%
collapg(list(fmean, fmedian), list(first, fmode, flast)) %>% head(3)
# # A tibble: 3 × 32
# Variable Country first.Regioncode fmode.Regioncode flast.Regioncode first.Region fmode.Region
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 EMP ARG LAM LAM LAM Latin America Latin America
# 2 EMP BOL LAM LAM LAM Latin America Latin America
# 3 EMP BRA LAM LAM LAM Latin America Latin America
# # ℹ 25 more variables: flast.Region <chr>, fmean.Year <dbl>, fmedian.Year <dbl>, fmean.AGR <dbl>,
# # fmedian.AGR <dbl>, fmean.MIN <dbl>, fmedian.MIN <dbl>, fmean.MAN <dbl>, fmedian.MAN <dbl>,
# # fmean.PU <dbl>, fmedian.PU <dbl>, fmean.CON <dbl>, fmedian.CON <dbl>, fmean.WRT <dbl>,
# # fmedian.WRT <dbl>, fmean.TRA <dbl>, fmedian.TRA <dbl>, fmean.FIRE <dbl>, fmedian.FIRE <dbl>,
# # fmean.GOV <dbl>, fmedian.GOV <dbl>, fmean.OTH <dbl>, fmedian.OTH <dbl>, fmean.SUM <dbl>,
# # fmedian.SUM <dbl>
Applying multiple functions to only numeric (or only categorical) data allows return in a long format:
GGDC10S %>% fgroup_by(Variable, Country) %>%
collapg(list(fmean, fmedian), cols = is.numeric, return = "long") %>% head(3)
# # A tibble: 3 × 15
# Function Variable Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 fmean EMP ARG 1980. 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. 992.
# 2 fmean EMP BOL 1980 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA 396.
# 3 fmean EMP BRA 1980. 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. 5710.
# # ℹ 1 more variable: SUM <dbl>
Finally, collapg
also makes it very easy to apply
aggregator functions to certain columns only:
GGDC10S %>% fgroup_by(Variable, Country) %>%
collapg(custom = list(fmean = 6:8, fmedian = 10:12)) %>% head(3)
# # A tibble: 3 × 8
# Variable Country AGR MIN MAN CON WRT TRA
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 782. 1855. 580.
# 2 EMP BOL 964. 56.0 235. 66.0 132. 97.0
# 3 EMP BRA 17191. 206. 6991. 4055. 6455. 1581.
To understand more about collapg
, look it up in the
documentation (?collapg
).
1.4 Weighted Aggregations
Weighted aggregations are possible with the functions
fsum, fprod, fmean, fmedian, fnth, fmode, fvar
and
fsd
. The implementation is such that by default (option
keep.w = TRUE
) these functions also aggregate the weights,
so that further weighted computations can be performed on the aggregated
data. fprod
saves the product of the weights, whereas the
other functions save the sum of the weights in a column next to the
grouping variables. If na.rm = TRUE
(the default), rows
with missing weights are omitted from the computation.
# This computes a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(AGR:SUM) %>% fsd(SUM) %>% head(3)
# # A tibble: 3 × 13
# Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 653615. 225. 22.2 176. 20.5 285. 856. 195. 493. 1123. 506.
# 2 EMP BOL 135452. 99.7 17.1 168. 4.87 123. 324. 98.1 69.8 NA 258.
# 3 EMP BRA 3364925. 1587. 73.8 2952. 93.8 1861. 6285. 1306. 3003. 3621. 4257.
# This computes a weighted grouped mode, taking the total EMP / VA as weight
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(AGR:SUM) %>% fmode(SUM) %>% head(3)
# # A tibble: 3 × 13
# Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 653615. 1162. 127. 2164. 152. 1415. 3768. 1060. 1748. 4336. 1999.
# 2 EMP BOL 135452. 819. 37.6 604. 10.8 433. 893. 333. 321. NA 1057.
# 3 EMP BRA 3364925. 16451. 313. 11841. 388. 8154. 21860. 5169. 12011. 12149. 14235.
The weighted variance / standard deviation is currently only implemented with frequency weights.
Weighted aggregations may also be performed with
collapg
. By default fsum
is used to compute a
sum of the weights, but it is also possible here to aggregate the
weights with other functions:
# This aggregates numeric colums using the weighted mean (the default) and categorical columns using the weighted mode (the default).
# Weights (column SUM) are aggregated using both the sum and the maximum.
GGDC10S %>% group_by(Variable, Country) %>%
collapg(w = SUM, wFUN = list(fsum, fmax)) %>% head(3)
# # A tibble: 3 × 17
# Variable Country fsum.SUM fmax.SUM Regioncode Region Year AGR MIN MAN PU CON WRT
# <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 653615. 17929. LAM Latin … 1985. 1361. 56.5 1935. 105. 811. 2217.
# 2 EMP BOL 135452. 4508. LAM Latin … 1987. 977. 57.9 296. 7.07 167. 400.
# 3 EMP BRA 3364925. 102572. LAM Latin … 1989. 17746. 238. 8466. 389. 4436. 11376.
# # ℹ 4 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>
2. Fast Transformations
collapse also provides some fast transformations that
significantly extend the scope and speed of manipulations that can be
performed with dplyr::mutate
.
2.1 Fast Transform and Compute Variables
The function ftransform
can be used to manipulate
columns in the same ways as mutate
:
GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
ftransform(AGR_perc = AGR / SUM * 100, # Computing % of VA in Agriculture
AGR_mean = fmean(AGR), # Average Agricultural VA
AGR = NULL, SUM = NULL) %>% # Deleting columns AGR and SUM
head
# # A tibble: 6 × 4
# Country Year AGR_perc AGR_mean
# <chr> <dbl> <dbl> <dbl>
# 1 BWA 1960 NA 5137561.
# 2 BWA 1961 NA 5137561.
# 3 BWA 1962 NA 5137561.
# 4 BWA 1963 NA 5137561.
# 5 BWA 1964 43.5 5137561.
# 6 BWA 1965 40.0 5137561.
The modification brought by ftransformv
enables
transformations of groups of columns like dplyr::mutate_at
and dplyr::mutate_if
:
# This replaces variables mpg, carb and wt by their log (.c turns expressions into character vectors)
mtcars %>% ftransformv(.c(mpg, carb, wt), log) %>% head
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 3.044522 6 160 110 3.90 0.9631743 16.46 0 1 4 1.3862944
# Mazda RX4 Wag 3.044522 6 160 110 3.90 1.0560527 17.02 0 1 4 1.3862944
# Datsun 710 3.126761 4 108 93 3.85 0.8415672 18.61 1 1 4 0.0000000
# Hornet 4 Drive 3.063391 6 258 110 3.08 1.1678274 19.44 1 0 3 0.0000000
# Hornet Sportabout 2.928524 8 360 175 3.15 1.2354715 17.02 0 0 3 0.6931472
# Valiant 2.895912 6 225 105 2.76 1.2412686 20.22 1 0 3 0.0000000
# Logging numeric variables
iris %>% ftransformv(is.numeric, log) %>% head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1.629241 1.252763 0.3364722 -1.6094379 setosa
# 2 1.589235 1.098612 0.3364722 -1.6094379 setosa
# 3 1.547563 1.163151 0.2623643 -1.6094379 setosa
# 4 1.526056 1.131402 0.4054651 -1.6094379 setosa
# 5 1.609438 1.280934 0.3364722 -1.6094379 setosa
# 6 1.686399 1.360977 0.5306283 -0.9162907 setosa
Instead of column = value
type arguments, it is also
possible to pass a single list of transformed variables to
ftransform
, which will be regarded in the same way as an
evaluated list of column = value
arguments. It can be used
for more complex transformations:
# Logging values and replacing generated Inf values
mtcars %>% ftransform(fselect(., mpg, cyl, vs:gear) %>% lapply(log) %>% replace_Inf) %>% head
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 3.044522 1.791759 160 110 3.90 2.620 16.46 NA 0 1.386294 4
# Mazda RX4 Wag 3.044522 1.791759 160 110 3.90 2.875 17.02 NA 0 1.386294 4
# Datsun 710 3.126761 1.386294 108 93 3.85 2.320 18.61 0 0 1.386294 1
# Hornet 4 Drive 3.063391 1.791759 258 110 3.08 3.215 19.44 0 NA 1.098612 1
# Hornet Sportabout 2.928524 2.079442 360 175 3.15 3.440 17.02 NA NA 1.098612 2
# Valiant 2.895912 1.791759 225 105 2.76 3.460 20.22 0 NA 1.098612 1
If only the computed columns need to be returned,
fcompute
provides an efficient alternative:
GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
fcompute(AGR_perc = AGR / SUM * 100,
AGR_mean = fmean(AGR)) %>% head
# # A tibble: 6 × 2
# AGR_perc AGR_mean
# <dbl> <dbl>
# 1 NA 5137561.
# 2 NA 5137561.
# 3 NA 5137561.
# 4 NA 5137561.
# 5 43.5 5137561.
# 6 40.0 5137561.
ftransform
and fcompute
are an order of
magnitude faster than mutate
, but they do not support
grouped computations using arbitrary functions. We will see that this is
hardly a limitation as collapse provides very efficient and
elegant alternative programming mechanisms…
2.2 Replacing and Sweeping out Statistics
All statistical (scalar-valued) functions in the collapse package
(fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct
)
have a TRA
argument which can be used to efficiently
transform data by either (column-wise) replacing data values with
computed statistics or sweeping the statistics out of the data.
Operations can be specified using either an integer or quoted operator /
string. The 10 operations supported by TRA
are:
1 - “replace_fill” : replace and overwrite missing values (same as
mutate
)2 - “replace” : replace but preserve missing values
3 - “-” : subtract (center)
4 - “-+” : subtract group-statistics but add average of group statistics
5 - “/” : divide (scale)
6 - “%” : compute percentages (divide and multiply by 100)
7 - “+” : add
8 - “*” : multiply
9 - “%%” : modulus
10 - “-%%” : subtract modulus
Simple transformations are again straightforward to specify:
# This subtracts the median value from all data points i.e. centers on the median
GGDC10S %>% num_vars %>% fmedian(TRA = "-") %>% head
# # A tibble: 6 × 12
# Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 -22 NA NA NA NA NA NA NA NA NA NA NA
# 2 -21 NA NA NA NA NA NA NA NA NA NA NA
# 3 -20 NA NA NA NA NA NA NA NA NA NA NA
# 4 -19 NA NA NA NA NA NA NA NA NA NA NA
# 5 -18 -4378. -170. -3717. -168. -1473. -3767. -1173. -959. -3924. -1431. -23149.
# 6 -17 -4379. -171. -3717. -168. -1472. -3767. -1173. -959. -3923. -1430. -23147.
# This replaces all data points with the mode
GGDC10S %>% char_vars %>% fmode(TRA = "replace") %>% head
# # A tibble: 6 × 4
# Country Regioncode Region Variable
# <chr> <chr> <chr> <chr>
# 1 USA ASI Asia EMP
# 2 USA ASI Asia EMP
# 3 USA ASI Asia EMP
# 4 USA ASI Asia EMP
# 5 USA ASI Asia EMP
# 6 USA ASI Asia EMP
Similarly for grouped transformations:
# Replacing data with the 2nd quartile (25%)
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fnth(0.25, TRA = "replace_fill") %>% head(3)
# # A tibble: 3 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 63.5 33.1 27.3 7.36 26.8 31.1 13.2 12.0 33.6 11.5 262.
# 2 VA BWA 63.5 33.1 27.3 7.36 26.8 31.1 13.2 12.0 33.6 11.5 262.
# 3 VA BWA 63.5 33.1 27.3 7.36 26.8 31.1 13.2 12.0 33.6 11.5 262.
# Scaling sectoral data by Variable and Country
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% head
# # A tibble: 6 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 0.0270 0.000556 0.000523 3.88e-4 5.11e-4 0.00194 0.00154 5.23e-4 0.00134
# 6 VA BWA 0.0260 0.000397 0.000723 5.03e-4 1.04e-3 0.00220 0.00180 5.83e-4 0.00158
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
The benchmarks below will demonstrate that these internal sweeping
and replacement operations fully performed in C++ compute significantly
faster than using dplyr::mutate
, especially as the number
of groups grows large. The S3 generic nature of the Fast Statistical
Functions further allows us to perform grouped mutations on the fly
(together with ftransform
or fcompute
),
without the need of first creating a grouped tibble:
# AGR_gmed = TRUE if AGR is greater than it's median value, grouped by Variable and Country
# Note: This calls fmedian.default
settransform(GGDC10S, AGR_gmed = AGR > fmedian(AGR, list(Variable, Country), TRA = "replace"))
tail(GGDC10S, 3)
# # A tibble: 3 × 17
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EGY MENA Middle Ea… EMP 2010 5206. 29.0 2436. 307. 2733. 2977. 1992. 801. 5539.
# 2 EGY MENA Middle Ea… EMP 2011 5186. 27.6 2374. 318. 2795. 3020. 2048. 815. 5636.
# 3 EGY MENA Middle Ea… EMP 2012 5161. 24.8 2348. 325. 2931. 3110. 2065. 832. 5736.
# # ℹ 3 more variables: OTH <dbl>, SUM <dbl>, AGR_gmed <lgl>
# Dividing (scaling) the sectoral data (columns 6 through 16) by their grouped standard deviation
settransformv(GGDC10S, 6:16, fsd, list(Variable, Country), TRA = "/", apply = FALSE)
tail(GGDC10S, 3)
# # A tibble: 3 × 17
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EGY MENA Middle Ea… EMP 2010 8.41 2.28 4.32 3.56 3.62 3.75 3.75 3.14 3.80
# 2 EGY MENA Middle Ea… EMP 2011 8.38 2.17 4.21 3.68 3.70 3.81 3.86 3.19 3.86
# 3 EGY MENA Middle Ea… EMP 2012 8.34 1.95 4.17 3.76 3.88 3.92 3.89 3.26 3.93
# # ℹ 3 more variables: OTH <dbl>, SUM <dbl>, AGR_gmed <lgl>
rm(GGDC10S)
Weights are easily added to any grouped transformation:
# This subtracts weighted group means from the data, using SUM column as weights..
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fmean(SUM, "-") %>% head
# # A tibble: 6 × 13
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 37.5 -1301. -13317. -2965. -529. -2746. -6540. -2157. -4431. -7551. -2613.
# 6 VA BWA 39.3 -1302. -13318. -2964. -529. -2745. -6540. -2156. -4431. -7550. -2613.
Sequential operations are also easily performed:
# This scales and then subtracts the median
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% fmedian(TRA = "-")
# # A tibble: 5,027 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -0.182 -0.235 -0.183 -0.245 -0.118 -0.0820 -0.0724 -0.0661 -0.108 -0.0848 -0.146
# 6 VA BWA -0.183 -0.235 -0.183 -0.245 -0.117 -0.0817 -0.0722 -0.0660 -0.108 -0.0846 -0.146
# 7 VA BWA -0.180 -0.235 -0.183 -0.245 -0.117 -0.0813 -0.0720 -0.0659 -0.107 -0.0843 -0.145
# 8 VA BWA -0.177 -0.235 -0.183 -0.245 -0.117 -0.0826 -0.0724 -0.0659 -0.107 -0.0841 -0.146
# 9 VA BWA -0.174 -0.235 -0.183 -0.245 -0.117 -0.0823 -0.0717 -0.0661 -0.108 -0.0848 -0.146
# 10 VA BWA -0.173 -0.234 -0.182 -0.243 -0.115 -0.0821 -0.0715 -0.0660 -0.108 -0.0846 -0.145
# # ℹ 5,017 more rows
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
Of course it is also possible to combine multiple functions as in the aggregation section, or to add variables to existing data:
# This adds a groupwise observation count next to each column
add_vars(GGDC10S, seq(7,27,2)) <- GGDC10S %>%
fgroup_by(Variable, Country) %>% fselect(AGR:SUM) %>%
fnobs("replace_fill") %>% add_stub("N_")
head(GGDC10S)
# # A tibble: 6 × 27
# Country Regioncode Region Variable Year AGR N_AGR MIN N_MIN MAN N_MAN PU N_PU CON
# <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <dbl>
# 1 BWA SSA Sub-sa… VA 1960 NA 47 NA 47 NA 47 NA 47 NA
# 2 BWA SSA Sub-sa… VA 1961 NA 47 NA 47 NA 47 NA 47 NA
# 3 BWA SSA Sub-sa… VA 1962 NA 47 NA 47 NA 47 NA 47 NA
# 4 BWA SSA Sub-sa… VA 1963 NA 47 NA 47 NA 47 NA 47 NA
# 5 BWA SSA Sub-sa… VA 1964 16.3 47 3.49 47 0.737 47 0.104 47 0.660
# 6 BWA SSA Sub-sa… VA 1965 15.7 47 2.50 47 1.02 47 0.135 47 1.35
# # ℹ 13 more variables: N_CON <int>, WRT <dbl>, N_WRT <int>, TRA <dbl>, N_TRA <int>, FIRE <dbl>,
# # N_FIRE <int>, GOV <dbl>, N_GOV <int>, OTH <dbl>, N_OTH <int>, SUM <dbl>, N_SUM <int>
rm(GGDC10S)
There are lots of other examples one could construct using the 10
operations and 14 functions listed above, the examples provided just
outline the suggested programming basics. Performance considerations
make it very much worthwhile to spend some time and think how complex
operations can be implemented in this programming framework, before
defining some function in R and applying it to data using
dplyr::mutate
.
2.3 More Control using the TRA
Function
Towards this end, calling TRA()
directly also
facilitates more complex and customized operations. Behind the scenes of
the TRA = ...
argument, the Fast Statistical
Functions first compute the grouped statistics on all columns of
the data, and these statistics are then directly fed into a C++ function
that uses them to replace or sweep them out of data points in one of the
10 ways described above. This function can also be called directly by
the name of TRA
.
Fundamentally, TRA
is a generalization of
base::sweep
for column-wise grouped operations1. Direct
calls to TRA
enable more control over inputs and
outputs.
The two operations below are equivalent, although the first is slightly more efficient as it only requires one method dispatch and one check of the inputs:
# This divides by the product
GGDC10S %>%
fgroup_by(Variable, Country) %>%
get_vars(6:16) %>% fprod(TRA = "/") %>% head
# # A tibble: 6 × 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92 1.01e-97 2.51e-117
# 6 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92 1.13e-97 2.96e-117
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
# Same thing
GGDC10S %>%
fgroup_by(Variable, Country) %>%
get_vars(6:16) %>%
TRA(fprod(., keep.group_vars = FALSE), "/") %>% head # [same as TRA(.,fprod(., keep.group_vars = FALSE),"/")]
# # A tibble: 6 × 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92 1.01e-97 2.51e-117
# 6 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92 1.13e-97 2.96e-117
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
TRA.grouped_df
was designed such that it matches the
columns of the statistics (aggregated columns) to those of the original
data, and only transforms matching columns while returning the whole
data frame. Thus it is easily possible to only apply a transformation to
the first two sectors:
# This only demeans Agriculture (AGR) and Mining (MIN)
GGDC10S %>%
fgroup_by(Variable, Country) %>%
TRA(fselect(., AGR, MIN) %>% fmean(keep.group_vars = FALSE), "-") %>% head
# # A tibble: 6 × 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s… VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA SSA Sub-s… VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA SSA Sub-s… VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA SSA Sub-s… VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA SSA Sub-s… VA 1964 -446. -4505. 0.737 0.104 0.660 6.24 1.66 1.12 4.82
# 6 BWA SSA Sub-s… VA 1965 -446. -4506. 1.02 0.135 1.35 7.06 1.94 1.25 5.70
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
Since TRA
is already built into all Fast Statistical
Functions as an argument, it is best used in computations where
grouped statistics are computed using some other function.
# Same as above, with one line of code using fmean.data.frame and ftransform...
GGDC10S %>% ftransform(fmean(list(AGR = AGR, MIN = MIN), list(Variable, Country), TRA = "-")) %>% head
# # A tibble: 6 × 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s… VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA SSA Sub-s… VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA SSA Sub-s… VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA SSA Sub-s… VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA SSA Sub-s… VA 1964 -446. -4505. 0.737 0.104 0.660 6.24 1.66 1.12 4.82
# 6 BWA SSA Sub-s… VA 1965 -446. -4506. 1.02 0.135 1.35 7.06 1.94 1.25 5.70
# # ℹ 2 more variables: OTH <dbl>, SUM <dbl>
Another potential use of TRA
is to do computations in
two- or more steps, for example if both aggregated and transformed data
are needed, or if computations are more complex and involve other
manipulations in-between the aggregating and sweeping part:
# Get grouped tibble
gGGDC <- GGDC10S %>% fgroup_by(Variable, Country)
# Get aggregated data
gsumGGDC <- gGGDC %>% fselect(AGR:SUM) %>% fsum
head(gsumGGDC)
# # A tibble: 6 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 88028. 3230. 1.20e5 6307. 4.60e4 1.23e5 4.02e4 3.89e4 1.27e5 6.15e4 6.54e5
# 2 EMP BOL 58817. 3418. 1.43e4 326. 7.49e3 1.72e4 7.04e3 2.72e3 NA 2.41e4 1.35e5
# 3 EMP BRA 1065864. 12773. 4.33e5 22604. 2.19e5 5.28e5 1.27e5 2.74e5 3.29e5 3.54e5 3.36e6
# 4 EMP BWA 8839. 493. 8.49e2 145. 1.19e3 1.71e3 3.93e2 7.21e2 2.87e3 1.30e3 1.85e4
# 5 EMP CHL 44220. 6389. 3.94e4 1850. 1.86e4 4.38e4 1.63e4 1.72e4 NA 6.32e4 2.51e5
# 6 EMP CHN 17264654. 422972. 4.03e6 96364. 1.25e6 1.73e6 8.36e5 2.96e5 1.36e6 1.86e6 2.91e7
# Get transformed (scaled) data
head(TRA(gGGDC, gsumGGDC, "/"))
# # A tibble: 6 × 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-sahar… VA 1960 NA NA NA NA NA NA
# 2 BWA SSA Sub-sahar… VA 1961 NA NA NA NA NA NA
# 3 BWA SSA Sub-sahar… VA 1962 NA NA NA NA NA NA
# 4 BWA SSA Sub-sahar… VA 1963 NA NA NA NA NA NA
# 5 BWA SSA Sub-sahar… VA 1964 7.50e-4 1.65e-5 1.66e-5 1.03e-5 1.57e-5 6.82e-5
# 6 BWA SSA Sub-sahar… VA 1965 7.24e-4 1.18e-5 2.30e-5 1.33e-5 3.20e-5 7.72e-5
# # ℹ 5 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>
As discussed, whether using the argument to fast statistical
functions or TRA
directly, these data transformations are
essentially a two-step process: Statistics are first computed and then
used to transform the original data.
Although both steps are efficiently done in C++, it would be even more efficient to do them in a single step without materializing all the statistics before transforming the data. Such slightly more efficient functions are provided for the very commonly applied tasks of centering and averaging data by groups (widely known as ‘between’-group and ‘within’-group transformations), and scaling and centering data by groups (also known as ‘standardizing’ data).
2.4 Faster Centering, Averaging and Standardizing
The functions fbetween
and fwithin
are
slightly more memory efficient implementations of fmean
invoked with different TRA
options:
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace")
fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fbetween %>% tail(2)
# # A tibble: 2 × 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4444. 34.9 1614. 131. 997. 1307. 799. 320. 2958. NA 12605.
# 2 4444. 34.9 1614. 131. 997. 1307. 799. 320. 2958. NA 12605.
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace_fill")
fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fbetween(fill = TRUE) %>% tail(2)
# # A tibble: 2 × 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4444. 34.9 1614. 131. 997. 1307. 799. 320. 2958. NA 12605.
# 2 4444. 34.9 1614. 131. 997. 1307. 799. 320. 2958. NA 12605.
GGDC10S %>% # Same as ... %>% fmean(TRA = "-")
fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fwithin %>% tail(2)
# # A tibble: 2 × 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 742. -7.35 760. 187. 1798. 1713. 1249. 495. 2678. NA 9614.
# 2 717. -10.1 734. 194. 1934. 1803. 1266. 512. 2778. NA 9928.
Apart from higher speed, fwithin
has a mean
argument to assign an arbitrary mean to centered data, the default being
mean = 0
. A very common choice for such an added mean is
just the overall mean of the data, which can be added in by invoking
mean = "overall.mean"
:
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fwithin(mean = "overall.mean") %>% tail(3)
# # A tibble: 3 × 13
# Country Variable AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EGY EMP 2527458. 1867903. 5539313. 3.36e5 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 NA 2.16e7
# 2 EGY EMP 2527439. 1867902. 5539251. 3.36e5 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 NA 2.16e7
# 3 EGY EMP 2527413. 1867899. 5539226. 3.36e5 1.80e6 3.39e6 1.47e6 1.66e6 1.72e6 NA 2.16e7
This can also be done using weights. The code below uses the
SUM
column as weights, and then for each variable and each
group subtracts out the weighted mean, and then adds the overall
weighted column mean back to the centered columns. The SUM
column is just kept as it is and added after the grouping columns.
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fwithin(SUM, mean = "overall.mean") %>% tail(3)
# # A tibble: 3 × 13
# Country Variable SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EGY EMP 22020. 429066006. 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8 NA
# 2 EGY EMP 22219. 429065986. 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8 NA
# 3 EGY EMP 22533. 429065961. 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8 NA
Another argument to fwithin
is the theta
parameter, allowing partial- or quasi-demeaning operations,
e.g. fwithin(gdata, theta = theta)
is equal to
gdata - theta * fbetween(gdata)
. This is particularly
useful to prepare data for variance components (also known as
‘random-effects’) estimation.
Apart from fbetween
and fwithin
, the
function fscale
exists to efficiently scale and center
data, to avoid sequential calls such as
... %>% fsd(TRA = "/") %>% fmean(TRA = "-")
.
# This efficiently scales and centers (i.e. standardizes) the data
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fscale
# # A tibble: 5,027 × 13
# Country Variable AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 5 BWA VA -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 6 BWA VA -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 7 BWA VA -0.736 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.595 -0.676
# 8 BWA VA -0.734 -0.717 -0.668 -0.805 -0.692 -0.604 -0.589 -0.635 -0.655 -0.595 -0.676
# 9 BWA VA -0.730 -0.717 -0.668 -0.805 -0.692 -0.604 -0.588 -0.635 -0.656 -0.596 -0.676
# 10 BWA VA -0.729 -0.716 -0.667 -0.803 -0.690 -0.603 -0.588 -0.635 -0.656 -0.596 -0.675
# # ℹ 5,017 more rows
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
fscale
also has additional mean
and
sd
arguments allowing the user to (group-) scale data to an
arbitrary mean and standard deviation. Setting mean = FALSE
just scales the data but preserves the means, and is thus different from
fsd(..., TRA = "/")
which simply divides all values by the
standard deviation:
# Saving grouped tibble
gGGDC <- GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM)
# Original means
head(fmean(gGGDC))
# # A tibble: 6 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35 123. 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09 25.3 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 29.4 296. 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1606. 20852. 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
# Mean Preserving Scaling
head(fmean(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35 123. 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09 25.3 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 29.4 296. 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1606. 20852. 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
head(fsd(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
# 2 EMP BOL 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 NA 1.00 1.00
# 3 EMP BRA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
# 4 EMP BWA 1.00 1.00 1.00 1 1 1.00 1.00 1.00 1.00 1.00 1.00
# 5 EMP CHL 1.00 1 1.00 1.00 1.00 1.00 1.00 1.00 NA 1.00 1.00
# 6 EMP CHN 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
One can also set mean = "overall.mean"
, which
group-centers columns on the overall mean as illustrated with
fwithin
. Another interesting option is setting
sd = "within.sd"
. This group-scales data such that every
group has a standard deviation equal to the within-standard deviation of
the data:
# Just using VA data for this example
gGGDC <- GGDC10S %>%
fsubset(Variable == "VA", Country, AGR:SUM) %>%
fgroup_by(Country)
# This calculates the within- standard deviation for all columns
fsd(num_vars(ungroup(fwithin(gGGDC))))
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# 45046972 40122220 75608708 3062688 30811572 44125207 20676901 16030868 20358973 18780869
# SUM
# 306429102
# This scales all groups to take on the within- standard deviation while preserving group means
fsd(fscale(gGGDC, mean = FALSE, sd = "within.sd"))
# # A tibble: 43 × 12
# Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 ARG 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 2 BOL 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 3 BRA 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 4 BWA 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 5 CHL 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 6 CHN 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 7 COL 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 8 CRI 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 9 DEW 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 10 DNK 45046972. 40122220. 75608708. 3062688. 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# # ℹ 33 more rows
A grouped scaling operation with both
mean = "overall.mean"
and sd = "within.sd"
thus efficiently achieves a harmonization of all groups in the first two
moments without changing the fundamental properties (in terms of level
and scale) of the data.
2.5 Lags / Leads, Differences and Growth Rates
This section introduces 3 further powerful collapse
functions: flag
, fdiff
and
fgrowth
. The first function, flag
, efficiently
computes sequences of fully identified lags and leads on time series and
panel data. The following code computes 1 fully-identified panel-lag and
1 fully identified panel-lead of each variable in the data:
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% flag(-1:1, Year)
# # A tibble: 5,027 × 36
# Country Variable Year F1.AGR AGR L1.AGR F1.MIN MIN L1.MIN F1.MAN MAN L1.MAN F1.PU PU
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 16.3 NA NA 3.49 NA NA 0.737 NA NA 0.104 NA
# 5 BWA VA 1964 15.7 16.3 NA 2.50 3.49 NA 1.02 0.737 NA 0.135 0.104
# 6 BWA VA 1965 17.7 15.7 16.3 1.97 2.50 3.49 0.804 1.02 0.737 0.203 0.135
# 7 BWA VA 1966 19.1 17.7 15.7 2.30 1.97 2.50 0.938 0.804 1.02 0.203 0.203
# 8 BWA VA 1967 21.1 19.1 17.7 1.84 2.30 1.97 0.750 0.938 0.804 0.203 0.203
# 9 BWA VA 1968 21.9 21.1 19.1 5.24 1.84 2.30 2.14 0.750 0.938 0.578 0.203
# 10 BWA VA 1969 23.1 21.9 21.1 10.2 5.24 1.84 4.15 2.14 0.750 1.12 0.578
# # ℹ 5,017 more rows
# # ℹ 22 more variables: L1.PU <dbl>, F1.CON <dbl>, CON <dbl>, L1.CON <dbl>, F1.WRT <dbl>, WRT <dbl>,
# # L1.WRT <dbl>, F1.TRA <dbl>, TRA <dbl>, L1.TRA <dbl>, F1.FIRE <dbl>, FIRE <dbl>, L1.FIRE <dbl>,
# # F1.GOV <dbl>, GOV <dbl>, L1.GOV <dbl>, F1.OTH <dbl>, OTH <dbl>, L1.OTH <dbl>, F1.SUM <dbl>,
# # SUM <dbl>, L1.SUM <dbl>
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
If the time-variable passed does not exactly identify the data
(i.e. because of repeated values in each group), all 3 functions will
issue appropriate error messages. flag
, fdiff
and fgrowth
support irregular time series and unbalanced
panels.
It is also possible to omit the time-variable if one is certain that the data is sorted:
GGDC10S %>%
fselect(Variable, Country,AGR:SUM) %>%
fgroup_by(Variable, Country) %>% flag
# # A tibble: 5,027 × 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 6 VA BWA 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82 2.34 37.5
# 7 VA BWA 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70 2.68 39.3
# 8 VA BWA 17.7 1.97 0.804 0.203 1.35 8.27 2.15 1.36 6.37 2.99 43.1
# 9 VA BWA 19.1 2.30 0.938 0.203 0.897 4.31 1.72 1.54 7.04 3.31 41.4
# 10 VA BWA 21.1 1.84 0.750 0.203 1.22 5.17 2.44 1.03 5.03 2.36 41.1
# # ℹ 5,017 more rows
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
fdiff
computes sequences of lagged-leaded and iterated
differences as well as quasi-differences and log-differences on time
series and panel data. The code below computes the 1 and 10 year first
and second differences of each variable in the data:
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fdiff(c(1, 10), 1:2, Year)
# # A tibble: 5,027 × 47
# Country Variable Year D1.AGR D2.AGR L10D1.AGR L10D2.AGR D1.MIN D2.MIN L10D1.MIN L10D2.MIN D1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 -0.575 NA NA NA -0.998 NA NA NA 0.282
# 7 BWA VA 1966 1.95 2.53 NA NA -0.525 0.473 NA NA -0.214
# 8 BWA VA 1967 1.47 -0.488 NA NA 0.328 0.854 NA NA 0.134
# 9 BWA VA 1968 1.95 0.488 NA NA -0.460 -0.788 NA NA -0.188
# 10 BWA VA 1969 0.763 -1.19 NA NA 3.41 3.87 NA NA 1.39
# # ℹ 5,017 more rows
# # ℹ 35 more variables: D2.MAN <dbl>, L10D1.MAN <dbl>, L10D2.MAN <dbl>, D1.PU <dbl>, D2.PU <dbl>,
# # L10D1.PU <dbl>, L10D2.PU <dbl>, D1.CON <dbl>, D2.CON <dbl>, L10D1.CON <dbl>, L10D2.CON <dbl>,
# # D1.WRT <dbl>, D2.WRT <dbl>, L10D1.WRT <dbl>, L10D2.WRT <dbl>, D1.TRA <dbl>, D2.TRA <dbl>,
# # L10D1.TRA <dbl>, L10D2.TRA <dbl>, D1.FIRE <dbl>, D2.FIRE <dbl>, L10D1.FIRE <dbl>,
# # L10D2.FIRE <dbl>, D1.GOV <dbl>, D2.GOV <dbl>, L10D1.GOV <dbl>, L10D2.GOV <dbl>, D1.OTH <dbl>,
# # D2.OTH <dbl>, L10D1.OTH <dbl>, L10D2.OTH <dbl>, D1.SUM <dbl>, D2.SUM <dbl>, L10D1.SUM <dbl>, …
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
Log-differences of the form \(log(x_t) - log(x_{t-s})\) are also easily computed.
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fdiff(c(1, 10), 1, Year, log = TRUE)
# # A tibble: 5,027 × 25
# Country Variable Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA
# 6 BWA VA 1965 -0.0359 NA -0.336 NA 0.324 NA
# 7 BWA VA 1966 0.117 NA -0.236 NA -0.236 NA
# 8 BWA VA 1967 0.0796 NA 0.154 NA 0.154 NA
# 9 BWA VA 1968 0.0972 NA -0.223 NA -0.223 NA
# 10 BWA VA 1969 0.0355 NA 1.05 NA 1.05 NA
# # ℹ 5,017 more rows
# # ℹ 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>, Dlog1.CON <dbl>, L10Dlog1.CON <dbl>,
# # Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>, L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>,
# # L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>, Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>,
# # Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
Finally, it is also possible to compute quasi-differences and quasi-log-differences of the form \(x_t - \rho x_{t-s}\) or \(log(x_t) - \rho log(x_{t-s})\):
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fdiff(t = Year, rho = 0.95)
# # A tibble: 5,027 × 14
# Country Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 0.241 -0.824 0.318 0.0359 0.719 1.13 0.363 0.184 1.11 0.454
# 7 BWA VA 1966 2.74 -0.401 -0.163 0.0743 0.0673 1.56 0.312 0.174 0.955 0.449
# 8 BWA VA 1967 2.35 0.427 0.174 0.0101 -0.381 -3.55 -0.323 0.246 0.988 0.465
# 9 BWA VA 1968 2.91 -0.345 -0.141 0.0101 0.365 1.08 0.804 -0.427 -1.66 -0.780
# 10 BWA VA 1969 1.82 3.50 1.43 0.385 2.32 0.841 0.397 0.252 0.818 0.385
# # ℹ 5,017 more rows
# # ℹ 1 more variable: SUM <dbl>
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
The quasi-differencing feature was added to fdiff
to
facilitate the preparation of time series and panel data for
least-squares estimations suffering from serial correlation following
Cochrane & Orcutt (1949).
Finally, fgrowth
computes growth rates in the same way.
By default exact growth rates are computed in percentage terms using
\((x_t-x_{t-s}) / x_{t-s} \times 100\)
(the default argument is scale = 100
). The user can also
request growth rates obtained by log-differencing using \(log(x_t/ x_{t-s}) \times 100\).
# Exact growth rates, computed as: (x/lag(x) - 1) * 100
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year)
# # A tibble: 5,027 × 25
# Country Variable Year G1.AGR L10G1.AGR G1.MIN L10G1.MIN G1.MAN L10G1.MAN G1.PU L10G1.PU G1.CON
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 -3.52 NA -28.6 NA 38.2 NA 29.4 NA 104.
# 7 BWA VA 1966 12.4 NA -21.1 NA -21.1 NA 50 NA 0
# 8 BWA VA 1967 8.29 NA 16.7 NA 16.7 NA 0 NA -33.3
# 9 BWA VA 1968 10.2 NA -20 NA -20 NA 0 NA 35.7
# 10 BWA VA 1969 3.61 NA 185. NA 185. NA 185. NA 185.
# # ℹ 5,017 more rows
# # ℹ 13 more variables: L10G1.CON <dbl>, G1.WRT <dbl>, L10G1.WRT <dbl>, G1.TRA <dbl>,
# # L10G1.TRA <dbl>, G1.FIRE <dbl>, L10G1.FIRE <dbl>, G1.GOV <dbl>, L10G1.GOV <dbl>, G1.OTH <dbl>,
# # L10G1.OTH <dbl>, G1.SUM <dbl>, L10G1.SUM <dbl>
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
# Log-difference growth rates, computed as: log(x / lag(x)) * 100
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year, logdiff = TRUE)
# # A tibble: 5,027 × 25
# Country Variable Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA
# 6 BWA VA 1965 -3.59 NA -33.6 NA 32.4 NA
# 7 BWA VA 1966 11.7 NA -23.6 NA -23.6 NA
# 8 BWA VA 1967 7.96 NA 15.4 NA 15.4 NA
# 9 BWA VA 1968 9.72 NA -22.3 NA -22.3 NA
# 10 BWA VA 1969 3.55 NA 105. NA 105. NA
# # ℹ 5,017 more rows
# # ℹ 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>, Dlog1.CON <dbl>, L10Dlog1.CON <dbl>,
# # Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>, L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>,
# # L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>, Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>,
# # Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
fdiff
and fgrowth
can also perform leaded
(forward) differences and growth rates
(i.e. ... %>% fgrowth(-c(1, 10), 1:2, Year)
would
compute one and 10-year leaded first and second differences). Again it
is possible to perform sequential operations:
# This computes the 1 and 10-year growth rates, for the current period and lagged by one period
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year) %>% flag(0:1, Year)
# # A tibble: 5,027 × 47
# Country Variable Year G1.AGR L1.G1.AGR L10G1.AGR L1.L10G1.AGR G1.MIN L1.G1.MIN L10G1.MIN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA
# 6 BWA VA 1965 -3.52 NA NA NA -28.6 NA NA
# 7 BWA VA 1966 12.4 -3.52 NA NA -21.1 -28.6 NA
# 8 BWA VA 1967 8.29 12.4 NA NA 16.7 -21.1 NA
# 9 BWA VA 1968 10.2 8.29 NA NA -20 16.7 NA
# 10 BWA VA 1969 3.61 10.2 NA NA 185. -20 NA
# # ℹ 5,017 more rows
# # ℹ 37 more variables: L1.L10G1.MIN <dbl>, G1.MAN <dbl>, L1.G1.MAN <dbl>, L10G1.MAN <dbl>,
# # L1.L10G1.MAN <dbl>, G1.PU <dbl>, L1.G1.PU <dbl>, L10G1.PU <dbl>, L1.L10G1.PU <dbl>,
# # G1.CON <dbl>, L1.G1.CON <dbl>, L10G1.CON <dbl>, L1.L10G1.CON <dbl>, G1.WRT <dbl>,
# # L1.G1.WRT <dbl>, L10G1.WRT <dbl>, L1.L10G1.WRT <dbl>, G1.TRA <dbl>, L1.G1.TRA <dbl>,
# # L10G1.TRA <dbl>, L1.L10G1.TRA <dbl>, G1.FIRE <dbl>, L1.G1.FIRE <dbl>, L10G1.FIRE <dbl>,
# # L1.L10G1.FIRE <dbl>, G1.GOV <dbl>, L1.G1.GOV <dbl>, L10G1.GOV <dbl>, L1.L10G1.GOV <dbl>, …
#
# Grouped by: Variable, Country [85 | 59 (7.7) 4-65]
3. Benchmarks
This section seeks to demonstrate that the functionality introduced in the preceding 2 sections indeed produces code that evaluates substantially faster than native dplyr.
To do this properly, the different components of a typical piped call (selecting / subsetting, ordering, grouping, and performing some computation) are benchmarked separately on 2 different data sizes.
All benchmarks are run on a Windows 8.1 laptop with a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD hard drive.
3.1 Data
Benchmarks are run on the original GGDC10S
data used
throughout this vignette and a larger dataset with approx. 1 million
observations, obtained by replicating and row-binding
GGDC10S
200 times while maintaining unique groups.
# This shows the groups in GGDC10S
GRP(GGDC10S, ~ Variable + Country)
# collapse grouping object of length 5027 with 85 ordered groups
#
# Call: GRP.default(X = GGDC10S, by = ~Variable + Country), X is unsorted
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.00 53.00 62.00 59.14 63.00 65.00
#
# Groups with sizes:
# EMP.ARG EMP.BOL EMP.BRA EMP.BWA EMP.CHL EMP.CHN
# 62 61 62 52 63 62
# ---
# VA.TWN VA.TZA VA.USA VA.VEN VA.ZAF VA.ZMB
# 63 52 65 63 52 52
# This replicates the data 200 times
data <- replicate(200, GGDC10S, simplify = FALSE)
# This function adds a number i to the country and variable columns of each dataset
uniquify <- function(x, i) ftransform(x, lapply(unclass(x)[c(1,4)], paste0, i))
# Making datasets unique and row-binding them
data <- unlist2d(Map(uniquify, data, as.list(1:200)), idcols = FALSE)
fdim(data)
# [1] 1005400 16
# This shows the groups in the replicated data
GRP(data, ~ Variable + Country)
# collapse grouping object of length 1005400 with 17000 ordered groups
#
# Call: GRP.default(X = data, by = ~Variable + Country), X is unsorted
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.00 53.00 62.00 59.14 63.00 65.00
#
# Groups with sizes:
# EMP1.ARG1 EMP1.BOL1 EMP1.BRA1 EMP1.BWA1 EMP1.CHL1 EMP1.CHN1
# 62 61 62 52 63 62
# ---
# VA99.TWN99 VA99.TZA99 VA99.USA99 VA99.VEN99 VA99.ZAF99 VA99.ZMB99
# 63 52 65 63 52 52
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 1230117 65.7 2298892 122.8 NA 2298892 122.8
# Vcells 18397824 140.4 27645074 211.0 16384 21571941 164.6
3.1 Selecting, Subsetting, Ordering and Grouping
## Selecting columns
# Small
microbenchmark(dplyr = select(GGDC10S, Country, Variable, AGR:SUM),
collapse = fselect(GGDC10S, Country, Variable, AGR:SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 1293.804 1411.6825 1539.27181 1472.664 1569.2895 2556.615 100
# collapse 7.925 10.4225 18.60163 18.506 25.4915 80.710 100
# Large
microbenchmark(dplyr = select(data, Country, Variable, AGR:SUM),
collapse = fselect(data, Country, Variable, AGR:SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 1306.134 1369.600 1431.93924 1405.0595 1461.056 1933.624 100
# collapse 7.902 9.428 16.29623 11.2565 23.393 46.141 100
## Subsetting columns
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA"),
collapse = fsubset(GGDC10S, Variable == "VA"))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 1263.007 1471.1175 1565.068 1533.331 1593.104 2816.881 100
# collapse 86.386 225.0545 237.104 243.403 269.532 362.129 100
# Large
microbenchmark(dplyr = filter(data, Variable == "VA"),
collapse = fsubset(data, Variable == "VA"))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 10.587768 16.289159 19.59112 16.583711 18.130880 162.5947 100
# collapse 6.462212 8.227627 10.96393 8.375484 9.481286 66.6160 100
## Ordering rows
# Small
microbenchmark(dplyr = arrange(GGDC10S, desc(Country), Variable, Year),
collapse = roworder(GGDC10S, -Country, Variable, Year))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 5865.495 6074.758 6728.6247 6243.248 6523.7460 19857.891 100
# collapse 676.340 702.541 788.4782 747.163 780.3655 2833.942 100
# Large
microbenchmark(dplyr = arrange(data, desc(Country), Variable, Year),
collapse = roworder(data, -Country, Variable, Year), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 169.1980 169.1980 179.9956 179.9956 190.7933 190.7933 2
# collapse 149.6938 149.6938 151.9128 151.9128 154.1319 154.1319 2
## Grouping
# Small
microbenchmark(dplyr = group_by(GGDC10S, Country, Variable),
collapse = fgroup_by(GGDC10S, Country, Variable))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 2333.565 2559.6840 2772.1049 2667.2590 2761.1610 14435.300 100
# collapse 295.266 308.5265 352.0251 361.3445 376.9055 831.725 100
# Large
microbenchmark(dplyr = group_by(data, Country, Variable),
collapse = fgroup_by(data, Country, Variable), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 63.51311 64.55712 70.69052 65.12443 70.32610 110.26681 10
# collapse 54.29580 54.72237 55.47530 55.01460 55.59795 58.83308 10
## Computing a new column
# Small
microbenchmark(dplyr = mutate(GGDC10S, NEW = AGR+1),
collapse = ftransform(GGDC10S, NEW = AGR+1))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 1061.069 1109.733 1165.19449 1141.4655 1222.6405 1502.039 100
# collapse 40.293 45.105 62.11237 54.9565 76.9085 203.228 100
# Large
microbenchmark(dplyr = mutate(data, NEW = AGR+1),
collapse = ftransform(data, NEW = AGR+1))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 2.201600 2.720108 5.062141 5.024468 5.182031 32.94185 100
# collapse 1.034854 1.473098 3.096822 3.708680 3.780488 23.70657 100
## All combined with pipes
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA") %>%
select(Country, Year, AGR:SUM) %>%
arrange(desc(Country), Year) %>%
mutate(NEW = AGR+1) %>%
group_by(Country),
collapse = fsubset(GGDC10S, Variable == "VA", Country, Year, AGR:SUM) %>%
roworder(-Country, Year) %>%
ftransform(NEW = AGR+1) %>%
fgroup_by(Country))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 9273.208 9618.865 10285.0387 9797.558 10058.4985 17886.526 100
# collapse 316.815 389.159 466.4565 449.526 511.7675 753.507 100
# Large
microbenchmark(dplyr = filter(data, Variable == "VA") %>%
select(Country, Year, AGR:SUM) %>%
arrange(desc(Country), Year) %>%
mutate(NEW = AGR+1) %>%
group_by(Country),
collapse = fsubset(data, Variable == "VA", Country, Year, AGR:SUM) %>%
roworder(-Country, Year) %>%
ftransform(NEW = AGR+1) %>%
fgroup_by(Country), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 23.36268 23.476515 23.890469 23.650624 23.855489 25.675314 10
# collapse 7.95352 8.063279 8.113638 8.081951 8.193313 8.302148 10
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 1235994 66.1 2298892 122.8 NA 2298892 122.8
# Vcells 18415743 140.6 56628884 432.1 16384 65340836 498.6
3.1 Aggregation
## Grouping the data
cgGGDC10S <- fgroup_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
gGGDC10S <- group_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
cgdata <- fgroup_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
gdata <- group_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
rm(data, GGDC10S)
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 1252996 67.0 2298892 122.8 NA 2298892 122.8
# Vcells 18034544 137.6 56628884 432.1 16384 65340836 498.6
## Conversion of Grouping object: This time would be required extra in all hybrid calls
## i.e. when calling collapse functions on data grouped with dplyr::group_by
# Small
microbenchmark(GRP(gGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# GRP(gGGDC10S) 20.456 20.8715 22.53989 21.157 21.5585 119.899 100
# Large
microbenchmark(GRP(gdata))
# Unit: milliseconds
# expr min lq mean median uq max neval
# GRP(gdata) 1.333725 2.841457 3.080191 2.92659 2.992167 27.80515 100
## Sum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sum, na.rm = TRUE),
collapse = fsum(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 9796.104 10120.6355 11441.3392 10342.7005 12210.801 20109.044 100
# collapse 199.142 214.2255 243.0461 251.0875 262.745 337.995 100
# Large
microbenchmark(dplyr = summarise_all(gdata, sum, na.rm = TRUE),
collapse = fsum(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 992.84663 1081.88163 1131.56837 1108.71517 1218.63609 1283.90879 10
# collapse 35.73932 36.39951 36.95217 37.13183 37.57566 37.83655 10
## Mean
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, mean.default, na.rm = TRUE),
collapse = fmean(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 12291.842 12944.2940 15050.4419 13467.689 14240.1980 31920.170 100
# collapse 164.582 182.1995 212.1745 218.547 231.8325 312.473 100
# Large
microbenchmark(dplyr = summarise_all(gdata, mean.default, na.rm = TRUE),
collapse = fmean(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 1868.83929 2086.54907 2191.40465 2162.74508 2375.43550 2430.97836 10
# collapse 28.90503 29.67181 30.92093 30.68978 31.78182 33.99441 10
## Median
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, median, na.rm = TRUE),
collapse = fmedian(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 41016.440 42140.904 47221.3157 44546.462 50961.724 77460.185 100
# collapse 242.552 258.997 288.2566 278.267 316.762 358.016 100
# Large
microbenchmark(dplyr = summarise_all(gdata, median, na.rm = TRUE),
collapse = fmedian(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 8351.66049 8351.66049 8442.69729 8442.69729 8533.73409 8533.73409 2
# collapse 32.48548 32.48548 39.40571 39.40571 46.32593 46.32593 2
## Standard Deviation
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sd, na.rm = TRUE),
collapse = fsd(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 21956.311 22573.0990 25147.5496 23193.770 26775.452 46433.530 100
# collapse 297.288 313.8295 343.5066 360.354 365.979 445.043 100
# Large
microbenchmark(dplyr = summarise_all(gdata, sd, na.rm = TRUE),
collapse = fsd(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 4115.55739 4115.55739 4147.93306 4147.93306 4180.30873 4180.30873 2
# collapse 58.73078 58.73078 59.11848 59.11848 59.50617 59.50617 2
## Maximum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, max, na.rm = TRUE),
collapse = fmax(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 142571.538 148832.0875 154487.1335 152224.168 155732.8770 240181.896 100
# collapse 94.538 106.2525 138.8007 126.372 157.8345 591.737 100
# Large
microbenchmark(dplyr = summarise_all(gdata, max, na.rm = TRUE),
collapse = fmax(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 1321.59878 1508.55966 1588.01880 1612.46456 1718.94561 1782.51821 10
# collapse 14.70121 14.96189 15.98858 15.42402 16.80483 19.70822 10
## First Value
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, first),
collapse = ffirst(cgGGDC10S, na.rm = FALSE))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 28600.975 29489.7935 32266.08136 30216.6780 33623.2150 61176.052 100
# collapse 34.102 41.1535 65.65912 81.6785 84.7585 137.993 100
# Large
microbenchmark(dplyr = summarise_all(gdata, first),
collapse = ffirst(cgdata, na.rm = FALSE), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 5392.670250 5480.666245 5959.414679 5828.230288 6016.249066 7850.307015 10
# collapse 2.821193 3.073678 3.211965 3.121979 3.410022 3.687975 10
## Number of Distinct Values
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, n_distinct, na.rm = TRUE),
collapse = fndistinct(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 33093.954 35295.963 39579.7509 36628.623 39966.0655 72087.292 100
# collapse 389.781 421.029 474.8587 461.051 488.3375 1031.323 100
# Large
microbenchmark(dplyr = summarise_all(gdata, n_distinct, na.rm = TRUE),
collapse = fndistinct(cgdata), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 6073.33658 6227.7011 6366.2213 6482.9552 6516.52959 6530.58389 5
# collapse 73.51751 73.7262 76.1062 75.2853 78.54948 79.45248 5
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 2051529 109.6 4603287 245.9 NA 4603287 245.9
# Vcells 19353891 147.7 56628884 432.1 16384 65340836 498.6
Below are some additional benchmarks for weighted aggregations and aggregations using the statistical mode, which cannot easily or efficiently be performed with dplyr.
## Weighted Mean
# Small
microbenchmark(fmean(cgGGDC10S, SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# fmean(cgGGDC10S, SUM) 186.103 188.1 197.4378 194.69 197.0455 393.619 100
# Large
microbenchmark(fmean(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmean(cgdata, SUM) 31.50519 32.64797 32.92088 32.877 33.5146 33.79636 10
## Weighted Standard-Deviation
# Small
microbenchmark(fsd(cgGGDC10S, SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# fsd(cgGGDC10S, SUM) 314.984 319.4455 354.6143 330.945 376.3585 517.547 100
# Large
microbenchmark(fsd(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fsd(cgdata, SUM) 57.34446 58.48555 59.29281 59.08383 60.63042 61.06044 10
## Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# fmode(cgGGDC10S) 549.737 552.3725 569.1913 555.4415 563.322 957.467 100
# Large
microbenchmark(fmode(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgdata) 99.75954 100.0756 103.6065 102.2925 105.4069 114.7621 10
## Weighted Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S, SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# fmode(cgGGDC10S, SUM) 636.031 639.753 652.8871 653.371 657.317 800.119 100
# Large
microbenchmark(fmode(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgdata, SUM) 119.0965 119.5147 121.5461 119.9586 123.6282 129.4281 10
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 2050986 109.6 4603287 245.9 NA 4603287 245.9
# Vcells 19350513 147.7 58004644 442.6 16384 68274814 520.9
3.2 Transformation
## Replacing with group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, sum, na.rm = TRUE),
collapse = fsum(cgGGDC10S, TRA = "replace_fill"))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 21392.849 22023.0585 23929.2401 22403.1340 23240.8270 45226.426 100
# collapse 433.988 464.3205 508.2899 505.5835 523.5725 828.668 100
# Large
microbenchmark(dplyr = mutate_all(gdata, sum, na.rm = TRUE),
collapse = fsum(cgdata, TRA = "replace_fill"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 1254.66191 1573.46591 1675.1984 1633.50438 1798.4097 2259.986 10
# collapse 79.45597 82.85858 141.0769 93.65949 125.7272 378.952 10
## Dividing by group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x/sum(x, na.rm = TRUE)),
collapse = fsum(cgGGDC10S, TRA = "/"))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 21889.932 22841.0065 24911.7123 23161.622 24045.6800 49516.978 100
# collapse 411.029 504.5385 546.2004 532.849 577.9385 845.263 100
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x/sum(x, na.rm = TRUE)),
collapse = fsum(cgdata, TRA = "/"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 1510.01025 1800.82433 1978.7045 2006.44629 2058.5870 2749.9129 10
# collapse 68.96276 93.81421 130.6796 97.06853 120.2051 406.4697 10
## Centering
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x-mean.default(x, na.rm = TRUE)),
collapse = fwithin(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 25110.212 26966.5815 30224.1595 27854.0375 29058.4985 60047.068 100
# collapse 286.664 457.6455 504.7102 497.8095 524.1495 1014.046 100
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x-mean.default(x, na.rm = TRUE)),
collapse = fwithin(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 2392.86214 2909.5813 2930.7914 2978.59425 3025.9808 3204.5505 10
# collapse 81.03107 82.4483 101.4395 87.77482 111.4296 180.9257 10
## Centering and Scaling (Standardizing)
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
collapse = fscale(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 38688.442 40532.1450 45232.3076 42211.4870 45527.9770 79025.874 100
# collapse 401.411 555.1765 600.8227 598.7885 640.9385 939.811 100
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
collapse = fscale(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 6154.3956 6154.3956 6305.7617 6305.7617 6457.1278 6457.1278 2
# collapse 104.4969 104.4969 104.9055 104.9055 105.3141 105.3141 2
## Lag
# Small
microbenchmark(dplyr_unordered = mutate(gGGDC10S, across(everything(), dplyr::lag)),
collapse_unordered = flag(cgGGDC10S),
dplyr_ordered = mutate(gGGDC10S, across(everything(), \(x) dplyr::lag(x, order_by = Year))),
collapse_ordered = flag(cgGGDC10S, t = Year))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr_unordered 38136.747 39462.6530 42614.3947 40011.208 42123.3465 65874.444 100
# collapse_unordered 157.782 307.8845 362.6440 337.906 378.5115 1965.686 100
# dplyr_ordered 66619.385 69225.0635 77369.4471 71373.073 81976.3080 272424.423 100
# collapse_ordered 213.680 356.4105 394.7982 379.769 427.9250 658.394 100
# Large
microbenchmark(dplyr_unordered = mutate(gdata, across(everything(), dplyr::lag)),
collapse_unordered = flag(cgdata),
dplyr_ordered = mutate(gdata, across(everything(), \(x) dplyr::lag(x, order_by = Year))),
collapse_ordered = flag(cgdata, t = Year), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_unordered 8443.92414 8443.92414 8550.28841 8550.28841 8656.65269 8656.65269 2
# collapse_unordered 38.79103 38.79103 44.77554 44.77554 50.76004 50.76004 2
# dplyr_ordered 14617.21496 14617.21496 15186.58806 15186.58806 15755.96115 15755.96115 2
# collapse_ordered 77.51494 77.51494 78.33778 78.33778 79.16062 79.16062 2
## First-Difference (unordered)
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, function(x) x - dplyr::lag(x)),
collapse_unordered = fdiff(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr_unordered 49511.165 54872.8255 82206.0940 59445.576 67956.618 473289.62 100
# collapse_unordered 194.216 388.4445 616.5843 425.185 461.996 13415.31 100
# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, function(x) x - dplyr::lag(x)),
collapse_unordered = fdiff(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_unordered 9598.78999 9598.78999 10077.1020 10077.1020 10555.4141 10555.4141 2
# collapse_unordered 97.15073 97.15073 101.0911 101.0911 105.0314 105.0314 2
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 2058368 110.0 4603289 245.9 NA 4603289 245.9
# Vcells 19367980 147.8 69685572 531.7 16384 69685572 531.7
Below again some benchmarks for transformations not easily of efficiently performed with dplyr, such as centering on the overall mean, mean-preserving scaling, weighted scaling and centering, sequences of lags / leads, (iterated) panel-differences and growth rates.
# Centering on overall mean
microbenchmark(fwithin(cgdata, mean = "overall.mean"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fwithin(cgdata, mean = "overall.mean") 65.18433 81.22043 117.2022 93.47876 113.5 309.1135 10
# Weighted Centering
microbenchmark(fwithin(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fwithin(cgdata, SUM) 60.08845 87.22967 115.5108 109.1515 121.9648 210.5066 10
microbenchmark(fwithin(cgdata, SUM, mean = "overall.mean"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max
# fwithin(cgdata, SUM, mean = "overall.mean") 81.67774 83.50361 103.9261 110.5742 113.8738 120.0302
# neval
# 10
# Weighted Scaling and Standardizing
microbenchmark(fsd(cgdata, SUM, TRA = "/"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fsd(cgdata, SUM, TRA = "/") 102.8622 110.5255 128.158 125.2058 140.2669 159.3003 10
microbenchmark(fscale(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fscale(cgdata, SUM) 77.04734 83.71536 105.0057 104.4375 121.8894 129.3736 10
# Sequence of lags and leads
microbenchmark(flag(cgdata, -1:1), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# flag(cgdata, -1:1) 75.26523 91.32304 127.853 108.7513 154.4453 281.3359 10
# Iterated difference
microbenchmark(fdiff(cgdata, 1, 2), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(cgdata, 1, 2) 54.48457 68.9282 83.70135 83.24708 96.69081 108.4353 10
# Growth Rate
microbenchmark(fgrowth(cgdata,1), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fgrowth(cgdata, 1) 32.64652 44.94556 60.49714 61.62277 72.32179 85.54476 10
References
Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.
Cochrane, D. & Orcutt, G. H. (1949). “Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms”. Journal of the American Statistical Association. 44 (245): 32–61.
Prais, S. J. & Winsten, C. B. (1954). “Trend Estimators and Serial Correlation”. Cowles Commission Discussion Paper No. 383. Chicago.