Fast and Easy Data Reshaping
pivot.Rd
pivot()
is collapse's data reshaping command. It combines longer-, wider-, and recast-pivoting functionality in a single parsimonious API. Notably, it can also accommodate variable labels.
Usage
pivot(data, # Summary of Documentation:
ids = NULL, # identifier cols to preserve
values = NULL, # cols containing the data
names = NULL, # name(s) of new col(s) | col(s) containing names
labels = NULL, # name of new labels col | col(s) containing labels
how = "longer", # method: "longer"/"l", "wider"/"w" or "recast"/"r"
na.rm = FALSE, # remove rows missing 'values' in reshaped data
factor = c("names", "labels"), # create new id col(s) as factor variable(s)?
check.dups = FALSE, # detect duplicate 'ids'+'names' combinations
# Only apply if how = "wider" or "recast"
FUN = "last", # aggregation function (internal or external)
FUN.args = NULL, # list of arguments passed to aggregation function
nthreads = .op[["nthreads"]], # minor gains as grouping remains serial
fill = NULL, # value to insert for unbalanced data (default NA/NULL)
drop = TRUE, # drop unused levels (=columns) if 'names' is factor
sort = FALSE, # "ids": sort 'ids' and/or "names": alphabetic casting
# Only applies if how = "wider" with multiple long columns ('values')
transpose = FALSE # "columns": applies t_list() before flattening, and/or
) # "names": sets names nami_colj. default: colj_nami
Arguments
- data
data frame-like object (list of equal-length columns).
- ids
identifier columns to keep. Specified using column names, indices, a logical vector or an identifier function e.g.
is_categorical
.- values
columns containing the data to be reshaped. Specified like
ids
.- names
names of columns to generate, or retrieve variable names from:
how
Description "longer"
list of names for the variable and value column in the long format, respectively. If NULL
,list("variable", "value")
will be chosen. Alternatively, a named list length 1 or 2 can be provided using "variable"/"value" as keys e.g.list(value = "data_col")
."wider"
column(s) containing names of the new variables. Specified using a vector of column names, indices, a logical vector or selector function e.g. is.character
. Multiple columns will be combined usingfinteraction
with"_"
as separator."recast"
(named) list with the following elements: [[1]]/[["from"]] - column(s) containing names of the new variables, specified as in "wider"
; [[2]]/[["to"]] - name of the variable to generate containing old column names. IfNULL
,list("variable", "variable")
will be chosen.- labels
names of columns to generate, or retrieve variable labels from:
how
Description "longer"
A string specifying the name of the column to store labels - retrieved from the data using vlabels(values)
.TRUE
will create a column named"label"
. Alternatively, a (named) list with two elements: [[1]]/[["name"]] - the name of the labels column; [[2]]/[["new"]] - a (named) character vector of new labels for the 'variable', 'label' and 'value' columns in the long-format frame. See Examples."wider"
column(s) containing labels of the new variables. Specified using a vector of column names, indices, a logical vector or selector function e.g. is.character
. Multiple columns will be combined usingfinteraction
with" - "
as separator."recast"
(named) list with the following elements: [[1]]/[["from"]] - column(s) containing labels for the new variables, specified as in "wider"
; [[2]]/[["to"]] - name of the variable to generate containing old labels; [[3]]/[["new"]] - a (named) character vector of new labels for the generated 'variable' and 'label' columns. If [[1]]/[["from"]] is not supplied, this can also include labels for new variables. Omitting one of the elements via a named list or setting it toNULL
in a list of 3 will omit the corresponding operation i.e. either not saving existing labels or not assigning new ones.- how
character. The pivoting method: one of
"longer"
,"wider"
or"recast"
. These can be abbreviated by the first letter i.e."l"/"w"/"r"
.- na.rm
logical.
TRUE
will remove missing values such that in the reshaped data there is no row missing all data columns - selected through 'values'. For wide/recast pivots using internalFUN
's"first"/"last"/"count"
, this also toggles skipping of missing values.- factor
character. Whether to generate new 'names' and/or 'labels' columns as factor variables. This is generally recommended as factors are more memory efficient than character vectors and also faster in subsequent filtering and grouping. Internally, this argument is evaluated as
factor <- c("names", "labels") %in% factor
, so passing anything other than"names"
and/or"labels"
will disable it.- check.dups
logical.
TRUE
checks for duplicate 'ids'+'names' combinations, and, if 'labels' are specified, also for duplicate 'names'+'labels' combinations. The defaultFALSE
implies that the algorithm just runs through the data, leading effectively to theFUN
option to be executed (default last value). See Details.- FUN
function to aggregate values. At present, only a single function is allowed. Fast Statistical Functions receive vectorized execution. For maximum efficiency, a small set of internal functions is provided:
"first"
,"last"
,"count"
,"sum"
,"mean"
,"min"
, or"max"
. In options"first"/"last"/"count"
settingna.rm = TRUE
skips missing values. In options"sum"/"mean"/"min"/"max"
missing values are always skipped (see Details why). Thefill
argument is ignored in"count"/"sum"/"mean"/"min"/"max"
("count"/"sum"
forcefill = 0
elseNA
is used).- FUN.args
(optional) list of arguments passed to
FUN
(if using an external function). Data-length arguments such as weight vectors are supported.- nthreads
integer. if
how = "wider"|"recast"
: number of threads to use with OpenMP (defaultget_collapse("nthreads")
, initialized to 1). Only the distribution of values to columns withhow = "wider"|"recast"
is multithreaded here. Since grouping id columns on a long data frame is expensive and serial, the gains are minor. Withhow = "long"
, multithreading does not make much sense as the most expensive operation is allocating the long results vectors. The rest is a couple ofmemset()
's in C to copy the values.- fill
if
how = "wider"|"recast"
: value to insert for 'ids'-'names' combinations not present in the long format.NULL
usesNA
for atomic vectors andNULL
for lists.- drop
logical. if
how = "wider"|"recast"
and 'names' is a single factor variable:TRUE
will check for and drop unused levels in that factor, avoiding the generation of empty columns.- sort
if
how = "wider"|"recast"
: specifying"ids"
applies ordered grouping on the id-columns, returning data sorted by ids. Specifying"names"
sorts the names before casting (unless 'names' is a factor), yielding columns cast in alphabetic order. Both options can be passed as a character vector, or, alternatively,TRUE
can be used to enable both.- transpose
if
how = "wider"|"recast"
and multiple columns are selected through 'values': specifying"columns"
appliest_list
to the result before flattening, resulting in a different column order. Specifying"names"
generates names of the form nami_colj, instead of colj_nami. Both options can be passed as a character vector, or, alternatively,TRUE
can be used to enable both.
Details
Pivot wider essentially works as follows: compute g_rows = group(ids)
and also g_cols = group(names)
(using group
if sort = FALSE
). g_rows
gives the row-numbers of the wider data frame and g_cols
the column numbers.
Then, a C function generates a wide data frame and runs through each long column ('values'), assigning each value to the corresponding row and column in the wide frame. In this process FUN
is always applied. The default, "last"
, does nothing at all, i.e., if there are duplicates, some values are overwritten. "first"
works similarly just that the C-loop is executed the other way around. The other hard-coded options count, sum, average, or compare observations on the fly. Missing values are internally skipped for statistical functions as there is no way to distinguish an incoming NA
from an initial NA
- apart from counting occurrences using an internal structure of the same size as the result data frame which is costly and thus not implemented.
When passing an R-function to FUN
, the data is grouped using g_full = group(list(g_rows, g_cols))
, aggregated by groups, and expanded again to full length using TRA
before entering the reshaping algorithm. Thus, this is significantly more expensive than the optimized internal functions. With Fast Statistical Functions the aggregation is vectorized across groups, other functions are applied using BY
- by far the slowest option.
If check.dups = TRUE
, a check of the form fnunique(list(g_rows, g_cols)) < fnrow(data)
is run, and an informative warning is issued if duplicates are found.
Recast pivoting works similarly. In long pivots FUN
is ignored and the check simply amounts to fnunique(ids) < fnrow(data)
.
Value
A reshaped data frame with the same class and attributes (except for 'names'/'row-names') as the input frame.
Note
Leaving either 'ids' or 'values' empty will assign all other columns (except for "variable"
if how = "wider"|"recast"
) to the non-specified argument. It is also possible to leave both empty, e.g. for complete melting if how = "wider"
or data transposition if how = "recast"
(similar to data.table::transpose
but supporting multiple names columns and variable labels). See Examples.
pivot
currently does not support concurrently melting/pivoting longer to multiple columns. See data.table::melt
or pivot_longer
from tidyr or tidytable for an efficient alternative with this feature. It is also possible to achieve this with just a little bit of programming. An example is provided below.
Examples
# -------------------------------- PIVOT LONGER ---------------------------------
# Simple Melting (Reshaping Long)
pivot(mtcars) |> head()
#> variable value
#> 1 mpg 21.0
#> 2 mpg 21.0
#> 3 mpg 22.8
#> 4 mpg 21.4
#> 5 mpg 18.7
#> 6 mpg 18.1
pivot(iris, "Species") |> head()
#> Species variable value
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Length 4.9
#> 3 setosa Sepal.Length 4.7
#> 4 setosa Sepal.Length 4.6
#> 5 setosa Sepal.Length 5.0
#> 6 setosa Sepal.Length 5.4
pivot(iris, values = 1:4) |> head() # Same thing
#> Species variable value
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Length 4.9
#> 3 setosa Sepal.Length 4.7
#> 4 setosa Sepal.Length 4.6
#> 5 setosa Sepal.Length 5.0
#> 6 setosa Sepal.Length 5.4
# Using collapse's datasets
head(wlddev)
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP
#> 1 32.446 NA 116769997 8996973
#> 2 32.962 NA 232080002 9169410
#> 3 33.471 NA 112839996 9351441
#> 4 33.971 NA 237720001 9543205
#> 5 34.463 NA 295920013 9744781
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
pivot(wlddev, 1:8, na.rm = TRUE) |> head()
#> country iso3c date year decade region income OECD variable
#> 1 Afghanistan AFG 2003-01-01 2002 2000 South Asia Low income FALSE PCGDP
#> 2 Afghanistan AFG 2004-01-01 2003 2000 South Asia Low income FALSE PCGDP
#> 3 Afghanistan AFG 2005-01-01 2004 2000 South Asia Low income FALSE PCGDP
#> 4 Afghanistan AFG 2006-01-01 2005 2000 South Asia Low income FALSE PCGDP
#> 5 Afghanistan AFG 2007-01-01 2006 2000 South Asia Low income FALSE PCGDP
#> 6 Afghanistan AFG 2008-01-01 2007 2000 South Asia Low income FALSE PCGDP
#> value
#> 1 330.3036
#> 2 343.0809
#> 3 333.2167
#> 4 357.2347
#> 5 365.2845
#> 6 405.5490
pivot(wlddev, c("iso3c", "year"), c("PCGDP", "LIFEEX"), na.rm = TRUE) |> head()
#> iso3c year variable value
#> 1 AFG 2002 PCGDP 330.3036
#> 2 AFG 2003 PCGDP 343.0809
#> 3 AFG 2004 PCGDP 333.2167
#> 4 AFG 2005 PCGDP 357.2347
#> 5 AFG 2006 PCGDP 365.2845
#> 6 AFG 2007 PCGDP 405.5490
head(GGDC10S)
#> Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT
#> 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA NA NA
#> 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA NA NA
#> 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA NA NA
#> 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA NA NA
#> TRA FIRE GOV OTH SUM
#> 1 NA NA NA NA NA
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 NA NA NA NA NA
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
pivot(GGDC10S, 1:5, names = list("Sectorcode", "Value"), na.rm = TRUE) |> head()
#> Country Regioncode Region Variable Year Sectorcode Value
#> 1 BWA SSA Sub-saharan Africa VA 1964 AGR 16.30154
#> 2 BWA SSA Sub-saharan Africa VA 1965 AGR 15.72700
#> 3 BWA SSA Sub-saharan Africa VA 1966 AGR 17.68066
#> 4 BWA SSA Sub-saharan Africa VA 1967 AGR 19.14591
#> 5 BWA SSA Sub-saharan Africa VA 1968 AGR 21.09957
#> 6 BWA SSA Sub-saharan Africa VA 1969 AGR 21.86221
# Can also set by name: variable and/or value. Note that 'value' here remains lowercase
pivot(GGDC10S, 1:5, names = list(variable = "Sectorcode"), na.rm = TRUE) |> head()
#> Country Regioncode Region Variable Year Sectorcode value
#> 1 BWA SSA Sub-saharan Africa VA 1964 AGR 16.30154
#> 2 BWA SSA Sub-saharan Africa VA 1965 AGR 15.72700
#> 3 BWA SSA Sub-saharan Africa VA 1966 AGR 17.68066
#> 4 BWA SSA Sub-saharan Africa VA 1967 AGR 19.14591
#> 5 BWA SSA Sub-saharan Africa VA 1968 AGR 21.09957
#> 6 BWA SSA Sub-saharan Africa VA 1969 AGR 21.86221
# Melting including saving labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = TRUE) |> head()
#> Country Regioncode Region Variable Year variable label
#> 1 BWA SSA Sub-saharan Africa VA 1964 AGR Agriculture
#> 2 BWA SSA Sub-saharan Africa VA 1965 AGR Agriculture
#> 3 BWA SSA Sub-saharan Africa VA 1966 AGR Agriculture
#> 4 BWA SSA Sub-saharan Africa VA 1967 AGR Agriculture
#> 5 BWA SSA Sub-saharan Africa VA 1968 AGR Agriculture
#> 6 BWA SSA Sub-saharan Africa VA 1969 AGR Agriculture
#> value
#> 1 16.30154
#> 2 15.72700
#> 3 17.68066
#> 4 19.14591
#> 5 21.09957
#> 6 21.86221
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = "description") |> head()
#> Country Regioncode Region Variable Year variable description
#> 1 BWA SSA Sub-saharan Africa VA 1964 AGR Agriculture
#> 2 BWA SSA Sub-saharan Africa VA 1965 AGR Agriculture
#> 3 BWA SSA Sub-saharan Africa VA 1966 AGR Agriculture
#> 4 BWA SSA Sub-saharan Africa VA 1967 AGR Agriculture
#> 5 BWA SSA Sub-saharan Africa VA 1968 AGR Agriculture
#> 6 BWA SSA Sub-saharan Africa VA 1969 AGR Agriculture
#> value
#> 1 16.30154
#> 2 15.72700
#> 3 17.68066
#> 4 19.14591
#> 5 21.09957
#> 6 21.86221
# Also assigning new labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = list("description",
c("Sector Code", "Sector Description", "Value"))) |> namlab()
#> Variable Label
#> 1 Country Country
#> 2 Regioncode Region code
#> 3 Region Region
#> 4 Variable Variable
#> 5 Year Year
#> 6 variable Sector Code
#> 7 description Sector Description
#> 8 value Value
# Can leave out value column by providing named vector of labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = list("description",
c(variable = "Sector Code", description = "Sector Description"))) |> namlab()
#> Variable Label
#> 1 Country Country
#> 2 Regioncode Region code
#> 3 Region Region
#> 4 Variable Variable
#> 5 Year Year
#> 6 variable Sector Code
#> 7 description Sector Description
#> 8 value <NA>
# Now here is a nice example that is explicit and respects the dataset naming conventions
pivot(GGDC10S, ids = 1:5, na.rm = TRUE,
names = list(variable = "Sectorcode",
value = "Value"),
labels = list(name = "Sector",
new = c(Sectorcode = "GGDC10S Sector Code",
Sector = "Long Sector Description",
Value = "Employment or Value Added"))) |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
#> Variable Class N Ndist Label
#> 1 Country character 46942 43 Country
#> 2 Regioncode character 46942 6 Region code
#> 3 Region character 46942 6 Region
#> 4 Variable character 46942 2 Variable
#> 5 Year numeric 46942 67 Year
#> 6 Sectorcode factor 46942 11 GGDC10S Sector Code
#> 7 Sector factor 46942 11 Long Sector Description
#> 8 Value numeric 46942 46478 Employment or Value Added
# Note that pivot() currently does not support melting to multiple columns
# But you can tackle the issue with a bit of programming:
wide <- pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable",
how = "wider", na.rm = TRUE)
head(wide)
#> Country Year AGR_VA AGR_EMP MAN_VA MAN_EMP SUM_VA SUM_EMP
#> 1 BWA 1964 16.30154 152.1179 0.7365696 2.420000 37.48229 173.8829
#> 2 BWA 1965 15.72700 153.2971 1.0181992 2.330406 39.34710 178.7637
#> 3 BWA 1966 17.68066 153.8867 0.8038415 1.281642 43.14677 179.7183
#> 4 BWA 1967 19.14591 155.0659 0.9378151 1.041623 41.39519 178.9181
#> 5 BWA 1968 21.09957 156.2451 0.7502521 1.069332 41.14259 181.9292
#> 6 BWA 1969 21.86221 157.4243 2.1396077 2.124402 51.22160 188.0569
library(magrittr)
wide %>% {av(pivot(., 1:2, grep("_VA", names(.))), pivot(gvr(., "_EMP")))} |> head()
#> Country Year variable value variable value
#> 1 BWA 1964 AGR_VA 16.30154 AGR_EMP 152.1179
#> 2 BWA 1965 AGR_VA 15.72700 AGR_EMP 153.2971
#> 3 BWA 1966 AGR_VA 17.68066 AGR_EMP 153.8867
#> 4 BWA 1967 AGR_VA 19.14591 AGR_EMP 155.0659
#> 5 BWA 1968 AGR_VA 21.09957 AGR_EMP 156.2451
#> 6 BWA 1969 AGR_VA 21.86221 AGR_EMP 157.4243
wide %>% {av(av(gv(., 1:2), rm_stub(gvr(., "_VA"), "_VA", pre = FALSE)) |>
pivot(1:2, names = list("Sectorcode", "VA"), labels = "Sector"),
EMP = vec(gvr(., "_EMP")))} |> head()
#> Country Year Sectorcode Sector VA EMP
#> 1 BWA 1964 AGR Agriculture 16.30154 152.1179
#> 2 BWA 1965 AGR Agriculture 15.72700 153.2971
#> 3 BWA 1966 AGR Agriculture 17.68066 153.8867
#> 4 BWA 1967 AGR Agriculture 19.14591 155.0659
#> 5 BWA 1968 AGR Agriculture 21.09957 156.2451
#> 6 BWA 1969 AGR Agriculture 21.86221 157.4243
rm(wide)
# -------------------------------- PIVOT WIDER ---------------------------------
iris_long <- pivot(iris, "Species") # Getting a long frame
head(iris_long)
#> Species variable value
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Length 4.9
#> 3 setosa Sepal.Length 4.7
#> 4 setosa Sepal.Length 4.6
#> 5 setosa Sepal.Length 5.0
#> 6 setosa Sepal.Length 5.4
# If 'names'/'values' not supplied, searches for 'variable' and 'value' columns
pivot(iris_long, how = "wider")
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.0 3.3 1.4 0.2
#> 2 versicolor 5.7 2.8 4.1 1.3
#> 3 virginica 5.9 3.0 5.1 1.8
# But here the records are not identified by 'Species': thus aggregation with last value:
pivot(iris_long, how = "wider", check = TRUE) # issues a warning
#> Warning: duplicates detected: there are 12 unique combinations of id- and name-columns, but the data has 600 rows. This means you have on average 50 duplicates per id-name-combination. If how = 'wider', pivot() will take the last of those duplicates in first-appearance-order. Consider aggregating your data e.g. using collap() before applying pivot().
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.0 3.3 1.4 0.2
#> 2 versicolor 5.7 2.8 4.1 1.3
#> 3 virginica 5.9 3.0 5.1 1.8
rm(iris_long)
# This works better, these two are inverse operations
wlddev |> pivot(1:8) |> pivot(how = "w") |> head()
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP
#> 1 32.446 NA 116769997 8996973
#> 2 32.962 NA 232080002 9169410
#> 3 33.471 NA 112839996 9351441
#> 4 33.971 NA 237720001 9543205
#> 5 34.463 NA 295920013 9744781
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# ...but not perfect, we loose labels
namlab(wlddev)
#> Variable
#> 1 country
#> 2 iso3c
#> 3 date
#> 4 year
#> 5 decade
#> 6 region
#> 7 income
#> 8 OECD
#> 9 PCGDP
#> 10 LIFEEX
#> 11 GINI
#> 12 ODA
#> 13 POP
#> Label
#> 1 Country Name
#> 2 Country Code
#> 3 Date Recorded (Fictitious)
#> 4 Year
#> 5 Decade
#> 6 Region
#> 7 Income Level
#> 8 Is OECD Member Country?
#> 9 GDP per capita (constant 2010 US$)
#> 10 Life expectancy at birth, total (years)
#> 11 Gini index (World Bank estimate)
#> 12 Net official development assistance and official aid received (constant 2018 US$)
#> 13 Population, total
wlddev |> pivot(1:8) |> pivot(how = "w") |> namlab()
#> Variable Label
#> 1 country Country Name
#> 2 iso3c Country Code
#> 3 date Date Recorded (Fictitious)
#> 4 year Year
#> 5 decade Decade
#> 6 region Region
#> 7 income Income Level
#> 8 OECD Is OECD Member Country?
#> 9 PCGDP <NA>
#> 10 LIFEEX <NA>
#> 11 GINI <NA>
#> 12 ODA <NA>
#> 13 POP <NA>
# But pivot() supports labels: these are perfect inverse operations
wlddev |> pivot(1:8, labels = "label") |> print(max = 50) |> # Notice the "label" column
pivot(how = "w", labels = "label") |> namlab()
#> country iso3c date year decade region income OECD variable
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE PCGDP
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE PCGDP
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE PCGDP
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE PCGDP
#> label value
#> 1 GDP per capita (constant 2010 US$) NA
#> 2 GDP per capita (constant 2010 US$) NA
#> 3 GDP per capita (constant 2010 US$) NA
#> 4 GDP per capita (constant 2010 US$) NA
#> [ reached 'max' / getOption("max.print") -- omitted 65876 rows ]
#> Variable
#> 1 country
#> 2 iso3c
#> 3 date
#> 4 year
#> 5 decade
#> 6 region
#> 7 income
#> 8 OECD
#> 9 PCGDP
#> 10 LIFEEX
#> 11 GINI
#> 12 ODA
#> 13 POP
#> Label
#> 1 Country Name
#> 2 Country Code
#> 3 Date Recorded (Fictitious)
#> 4 Year
#> 5 Decade
#> 6 Region
#> 7 Income Level
#> 8 Is OECD Member Country?
#> 9 GDP per capita (constant 2010 US$)
#> 10 Life expectancy at birth, total (years)
#> 11 Gini index (World Bank estimate)
#> 12 Net official development assistance and official aid received (constant 2018 US$)
#> 13 Population, total
# If the data does not have 'variable'/'value' cols: need to specify 'names'/'values'
# Using a single column:
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w") |> head()
#> Country Year VA EMP
#> 1 BWA 1960 NA NA
#> 2 BWA 1961 NA NA
#> 3 BWA 1962 NA NA
#> 4 BWA 1963 NA NA
#> 5 BWA 1964 37.48229 173.8829
#> 6 BWA 1965 39.34710 178.7637
SUM_wide <- pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w", na.rm = TRUE)
head(SUM_wide) # na.rm = TRUE here removes all new rows completely missing data
#> Country Year VA EMP
#> 1 BWA 1964 37.48229 173.8829
#> 2 BWA 1965 39.34710 178.7637
#> 3 BWA 1966 43.14677 179.7183
#> 4 BWA 1967 41.39519 178.9181
#> 5 BWA 1968 41.14259 181.9292
#> 6 BWA 1969 51.22160 188.0569
tail(SUM_wide) # But there may still be NA's, notice the NA in the final row
#> Country Year VA EMP
#> 2341 EGY 2008 844222.3 21039.90
#> 2342 EGY 2009 978684.0 21863.86
#> 2343 EGY 2010 1133629.1 22019.88
#> 2344 EGY 2011 1290896.1 22219.39
#> 2345 EGY 2012 1487175.1 22532.56
#> 2346 EGY 2013 1650962.8 NA
# We could use fill to set another value
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w",
na.rm = TRUE, fill = -9999) |> tail()
#> Country Year VA EMP
#> 2341 EGY 2008 844222.3 21039.90
#> 2342 EGY 2009 978684.0 21863.86
#> 2343 EGY 2010 1133629.1 22019.88
#> 2344 EGY 2011 1290896.1 22219.39
#> 2345 EGY 2012 1487175.1 22532.56
#> 2346 EGY 2013 1650962.8 -9999.00
# This will keep the label of "SUM", unless we supply a column with new labels
namlab(SUM_wide)
#> Variable Label
#> 1 Country Country
#> 2 Year Year
#> 3 VA Summation of sector GDP
#> 4 EMP Summation of sector GDP
# Such a column is not available here, but we could use "Variable" twice
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", "Variable", how = "w",
na.rm = TRUE) |> namlab()
#> Variable Label
#> 1 Country Country
#> 2 Year Year
#> 3 VA VA
#> 4 EMP EMP
# Alternatively, can of course relabel ex-post
SUM_wide |> relabel(VA = "Value Added", EMP = "Employment") |> namlab()
#> Variable Label
#> 1 Country Country
#> 2 Year Year
#> 3 VA Value Added
#> 4 EMP Employment
rm(SUM_wide)
# Multiple-column pivots
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE) |> head()
#> Country Year AGR_VA AGR_EMP MAN_VA MAN_EMP SUM_VA SUM_EMP
#> 1 BWA 1964 16.30154 152.1179 0.7365696 2.420000 37.48229 173.8829
#> 2 BWA 1965 15.72700 153.2971 1.0181992 2.330406 39.34710 178.7637
#> 3 BWA 1966 17.68066 153.8867 0.8038415 1.281642 43.14677 179.7183
#> 4 BWA 1967 19.14591 155.0659 0.9378151 1.041623 41.39519 178.9181
#> 5 BWA 1968 21.09957 156.2451 0.7502521 1.069332 41.14259 181.9292
#> 6 BWA 1969 21.86221 157.4243 2.1396077 2.124402 51.22160 188.0569
# Here we may prefer a transposed column order
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = "columns") |> head()
#> Country Year AGR_VA MAN_VA SUM_VA AGR_EMP MAN_EMP SUM_EMP
#> 1 BWA 1964 16.30154 0.7365696 37.48229 152.1179 2.420000 173.8829
#> 2 BWA 1965 15.72700 1.0181992 39.34710 153.2971 2.330406 178.7637
#> 3 BWA 1966 17.68066 0.8038415 43.14677 153.8867 1.281642 179.7183
#> 4 BWA 1967 19.14591 0.9378151 41.39519 155.0659 1.041623 178.9181
#> 5 BWA 1968 21.09957 0.7502521 41.14259 156.2451 1.069332 181.9292
#> 6 BWA 1969 21.86221 2.1396077 51.22160 157.4243 2.124402 188.0569
# Can also flip the order of names (independently of columns)
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = "names") |> head()
#> Country Year VA_AGR EMP_AGR VA_MAN EMP_MAN VA_SUM EMP_SUM
#> 1 BWA 1964 16.30154 152.1179 0.7365696 2.420000 37.48229 173.8829
#> 2 BWA 1965 15.72700 153.2971 1.0181992 2.330406 39.34710 178.7637
#> 3 BWA 1966 17.68066 153.8867 0.8038415 1.281642 43.14677 179.7183
#> 4 BWA 1967 19.14591 155.0659 0.9378151 1.041623 41.39519 178.9181
#> 5 BWA 1968 21.09957 156.2451 0.7502521 1.069332 41.14259 181.9292
#> 6 BWA 1969 21.86221 157.4243 2.1396077 2.124402 51.22160 188.0569
# Can also enable both (complete transposition)
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = TRUE) |> head() # or tranpose = c("columns", "names")
#> Country Year VA_AGR VA_MAN VA_SUM EMP_AGR EMP_MAN EMP_SUM
#> 1 BWA 1964 16.30154 0.7365696 37.48229 152.1179 2.420000 173.8829
#> 2 BWA 1965 15.72700 1.0181992 39.34710 153.2971 2.330406 178.7637
#> 3 BWA 1966 17.68066 0.8038415 43.14677 153.8867 1.281642 179.7183
#> 4 BWA 1967 19.14591 0.9378151 41.39519 155.0659 1.041623 178.9181
#> 5 BWA 1968 21.09957 0.7502521 41.14259 156.2451 1.069332 181.9292
#> 6 BWA 1969 21.86221 2.1396077 51.22160 157.4243 2.124402 188.0569
# Finally, here is a nice, simple way to reshape the entire dataset.
pivot(GGDC10S, values = 6:16, names = "Variable", na.rm = TRUE, how = "w") |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
#> Variable Class N Ndist Label
#> 1 Country character 2346 43 Country
#> 2 Regioncode character 2346 6 Region code
#> 3 Region character 2346 6 Region
#> 4 Year numeric 2346 67 Year
#> 5 AGR_VA numeric 2139 2135 Agriculture
#> 6 AGR_EMP numeric 2225 2219 Agriculture
#> 7 MIN_VA numeric 2139 2072 Mining
#> 8 MIN_EMP numeric 2216 2153 Mining
#> 9 MAN_VA numeric 2139 2139 Manufacturing
#> 10 MAN_EMP numeric 2216 2214 Manufacturing
#> 11 PU_VA numeric 2139 2097 Utilities
#> 12 PU_EMP numeric 2215 2141 Utilities
#> 13 CON_VA numeric 2139 2130 Construction
#> 14 CON_EMP numeric 2216 2209 Construction
#> [ reached 'max' / getOption("max.print") -- omitted 12 rows ]
# -------------------------------- PIVOT RECAST ---------------------------------
# Look at the data again
head(GGDC10S)
#> Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT
#> 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA NA NA
#> 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA NA NA
#> 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA NA NA
#> 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA NA NA
#> TRA FIRE GOV OTH SUM
#> 1 NA NA NA NA NA
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 NA NA NA NA NA
#> [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# Let's stack the sectors and instead create variable columns
pivot(GGDC10S, .c(Country, Regioncode, Region, Year),
names = list("Variable", "Sectorcode"), how = "r") |> head()
#> Country Regioncode Region Year Sectorcode VA EMP
#> 1 BWA SSA Sub-saharan Africa 1960 AGR NA NA
#> 2 BWA SSA Sub-saharan Africa 1961 AGR NA NA
#> 3 BWA SSA Sub-saharan Africa 1962 AGR NA NA
#> 4 BWA SSA Sub-saharan Africa 1963 AGR NA NA
#> 5 BWA SSA Sub-saharan Africa 1964 AGR 16.30154 152.1179
#> 6 BWA SSA Sub-saharan Africa 1965 AGR 15.72700 153.2971
# Same thing (a bit easier)
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"), how = "r") |> head()
#> Country Regioncode Region Year Sectorcode VA EMP
#> 1 BWA SSA Sub-saharan Africa 1960 AGR NA NA
#> 2 BWA SSA Sub-saharan Africa 1961 AGR NA NA
#> 3 BWA SSA Sub-saharan Africa 1962 AGR NA NA
#> 4 BWA SSA Sub-saharan Africa 1963 AGR NA NA
#> 5 BWA SSA Sub-saharan Africa 1964 AGR 16.30154 152.1179
#> 6 BWA SSA Sub-saharan Africa 1965 AGR 15.72700 153.2971
# Removing missing values
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"), how = "r",
na.rm = TRUE) |> head()
#> Country Regioncode Region Year Sectorcode VA EMP
#> 1 BWA SSA Sub-saharan Africa 1960 AGR NA NA
#> 2 BWA SSA Sub-saharan Africa 1961 AGR NA NA
#> 3 BWA SSA Sub-saharan Africa 1962 AGR NA NA
#> 4 BWA SSA Sub-saharan Africa 1963 AGR NA NA
#> 5 BWA SSA Sub-saharan Africa 1964 AGR 16.30154 152.1179
#> 6 BWA SSA Sub-saharan Africa 1965 AGR 15.72700 153.2971
# Saving Labels
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
labels = list(to = "Sector"), how = "r", na.rm = TRUE) |> head()
#> Country Regioncode Region Year Sectorcode Sector VA
#> 1 BWA SSA Sub-saharan Africa 1960 AGR Agriculture NA
#> 2 BWA SSA Sub-saharan Africa 1961 AGR Agriculture NA
#> 3 BWA SSA Sub-saharan Africa 1962 AGR Agriculture NA
#> 4 BWA SSA Sub-saharan Africa 1963 AGR Agriculture NA
#> 5 BWA SSA Sub-saharan Africa 1964 AGR Agriculture 16.30154
#> 6 BWA SSA Sub-saharan Africa 1965 AGR Agriculture 15.72700
#> EMP
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 152.1179
#> 6 153.2971
# Supplying new labels for generated columns: as complete as it gets
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
labels = list(to = "Sector",
new = c(Sectorcode = "GGDC10S Sector Code",
Sector = "Long Sector Description",
VA = "Value Added",
EMP = "Employment")), how = "r", na.rm = TRUE) |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
#> Variable Class N Ndist Label
#> 1 Country character 27852 43 Country
#> 2 Regioncode character 27852 6 Region code
#> 3 Region character 27852 6 Region
#> 4 Year numeric 27852 67 Year
#> 5 Sectorcode factor 27852 11 GGDC10S Sector Code
#> 6 Sector factor 27852 11 Long Sector Description
#> 7 VA numeric 23092 22915 Value Added
#> 8 EMP numeric 23850 23610 Employment
# Now another (slightly unconventional) use case here is data transposition
# Let's get the data for Botswana
BWA <- GGDC10S |> fsubset(Country == "BWA", Variable, Year, AGR:SUM)
head(BWA)
#> Variable Year AGR MIN MAN PU CON WRT
#> 1 VA 1960 NA NA NA NA NA NA
#> 2 VA 1961 NA NA NA NA NA NA
#> 3 VA 1962 NA NA NA NA NA NA
#> 4 VA 1963 NA NA NA NA NA NA
#> 5 VA 1964 16.30154 3.494075 0.7365696 0.1043936 0.6600454 6.243732
#> TRA FIRE GOV OTH SUM
#> 1 NA NA NA NA NA
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 NA NA NA NA NA
#> 5 1.658928 1.119194 4.822485 2.341328 37.48229
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
# By supplying no ids or values, we are simply requesting a transpose operation
pivot(BWA, names = list(from = c("Variable", "Year"), to = "Sectorcode"), how = "r")
#> Sectorcode VA_1960 VA_1961 VA_1962 VA_1963 VA_1964 VA_1965 VA_1966 VA_1967
#> VA_1968 VA_1969 VA_1970 VA_1971 VA_1972 VA_1973 VA_1974 VA_1975 VA_1976
#> VA_1977 VA_1978 VA_1979 VA_1980 VA_1981 VA_1982 VA_1983 VA_1984 VA_1985
#> VA_1986 VA_1987 VA_1988 VA_1989 VA_1990 VA_1991 VA_1992 VA_1993 VA_1994
#> VA_1995 VA_1996 VA_1997 VA_1998 VA_1999 VA_2000 VA_2001 VA_2002 VA_2003
#> VA_2004 VA_2005 VA_2006 VA_2007 VA_2008 VA_2009 VA_2010 VA_2011 EMP_1960
#> EMP_1961 EMP_1962 EMP_1963 EMP_1964 EMP_1965 EMP_1966 EMP_1967 EMP_1968
#> EMP_1969 EMP_1970 EMP_1971 EMP_1972 EMP_1973 EMP_1974 EMP_1975 EMP_1976
#> EMP_1977 EMP_1978 EMP_1979 EMP_1980 EMP_1981 EMP_1982 EMP_1983 EMP_1984
#> EMP_1985 EMP_1986 EMP_1987 EMP_1988 EMP_1989 EMP_1990 EMP_1991 EMP_1992
#> EMP_1993 EMP_1994 EMP_1995 EMP_1996 EMP_1997 EMP_1998 EMP_1999 EMP_2000
#> EMP_2001 EMP_2002 EMP_2003 EMP_2004 EMP_2005 EMP_2006 EMP_2007 EMP_2008
#> EMP_2009 EMP_2010 EMP_2011
#> [ reached 'max' / getOption("max.print") -- omitted 11 rows ]
# Same with labels
pivot(BWA, names = list(from = c("Variable", "Year"), to = "Sectorcode"),
labels = list(to = "Sector"), how = "r")
#> Sectorcode Sector VA_1960 VA_1961 VA_1962 VA_1963 VA_1964 VA_1965 VA_1966
#> VA_1967 VA_1968 VA_1969 VA_1970 VA_1971 VA_1972 VA_1973 VA_1974 VA_1975
#> VA_1976 VA_1977 VA_1978 VA_1979 VA_1980 VA_1981 VA_1982 VA_1983 VA_1984
#> VA_1985 VA_1986 VA_1987 VA_1988 VA_1989 VA_1990 VA_1991 VA_1992 VA_1993
#> VA_1994 VA_1995 VA_1996 VA_1997 VA_1998 VA_1999 VA_2000 VA_2001 VA_2002
#> VA_2003 VA_2004 VA_2005 VA_2006 VA_2007 VA_2008 VA_2009 VA_2010 VA_2011
#> EMP_1960 EMP_1961 EMP_1962 EMP_1963 EMP_1964 EMP_1965 EMP_1966 EMP_1967
#> EMP_1968 EMP_1969 EMP_1970 EMP_1971 EMP_1972 EMP_1973 EMP_1974 EMP_1975
#> EMP_1976 EMP_1977 EMP_1978 EMP_1979 EMP_1980 EMP_1981 EMP_1982 EMP_1983
#> EMP_1984 EMP_1985 EMP_1986 EMP_1987 EMP_1988 EMP_1989 EMP_1990 EMP_1991
#> EMP_1992 EMP_1993 EMP_1994 EMP_1995 EMP_1996 EMP_1997 EMP_1998 EMP_1999
#> EMP_2000 EMP_2001 EMP_2002 EMP_2003 EMP_2004 EMP_2005 EMP_2006 EMP_2007
#> EMP_2008 EMP_2009 EMP_2010 EMP_2011
#> [ reached 'max' / getOption("max.print") -- omitted 11 rows ]
# For simple cases, data.table::transpose() will be more efficient, but with multiple
# columns to generate names and/or variable labels to be saved/assigned, pivot() is handy
rm(BWA)