Skip to contents

This short vignette focuses on using collapse with the popular sf package by Edzer Pebesma. It shows that collapse supports easy manipulation of sf data frames, at computation speeds far above dplyr.

collapse is class-agnostic and can theoretically be used to manipulate any kind of list, matrix or data frame based object. Previous versions of collapse could manipulate sf data frames, but required users to manually include the geometry column when using functions like in fselect and fsubset, and exclude it in other functions like qsu.

Version 1.6.0 of the package adds internal support for sf data frames where the geometry column is handled by most essential functions (fselect, get_vars, num_vars, fsubset, ss, fgroup_by, qsu, descr, varying, funique, roworder, rsplit, fcompute, …). This largely does not affect previously written code where the geometry column is explicitly selected. To demonstrate these features, we can load a test dataset provided by sf:

library(collapse)
library(sf)

nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
options(sf_max_print = 3)
nc
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091     1      10  1364     0
# 2 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487     0      10   542     3
# 3 0.143     1.630  1828    1828     Surry 37171  37171       86  3188     5     208  3616     6
#   NWBIR79                       geometry
# 1      19 MULTIPOLYGON (((-81.47276 3...
# 2      12 MULTIPOLYGON (((-81.23989 3...
# 3     260 MULTIPOLYGON (((-80.45634 3...

Summarising sf Data Frames

The first thing that collapse 1.6.0 adds is summary statistics for sf data frames that automatically exclude the ‘geometry’ column:

# Which columns have at least 2 non-missing distinct values
varying(nc) 
#      AREA PERIMETER     CNTY_   CNTY_ID      NAME      FIPS    FIPSNO  CRESS_ID     BIR74     SID74 
#      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE 
#   NWBIR74     BIR79     SID79   NWBIR79 
#      TRUE      TRUE      TRUE      TRUE

# Quick summary stats
qsu(nc)
#              N     Mean         SD    Min    Max
# AREA       100   0.1263     0.0492  0.042  0.241
# PERIMETER  100    1.673     0.4823  0.999   3.64
# CNTY_      100  1985.96   106.5166   1825   2241
# CNTY_ID    100  1985.96   106.5166   1825   2241
# NAME       100        -          -      -      -
# FIPS       100        -          -      -      -
# FIPSNO     100    37100     58.023  37001  37199
# CRESS_ID   100     50.5    29.0115      1    100
# BIR74      100  3299.62  3848.1651    248  21588
# SID74      100     6.67     7.7812      0     44
# NWBIR74    100  1050.81  1432.9117      1   8027
# BIR79      100  4223.92  5179.4582    319  30757
# SID79      100     8.36     9.4319      0     57
# NWBIR79    100  1352.81  1975.9988      3  11631

# Detailed statistics description of each column
descr(nc)
# Dataset: nc, 14 Variables, N = 100
# ----------------------------------------------------------------------------------------------------
# AREA (numeric): 
# Statistics
#     N  Ndist  Mean    SD   Min   Max  Skew  Kurt
#   100     77  0.13  0.05  0.04  0.24  0.48   2.5
# Quantiles
#     1%    5%   10%   25%   50%   75%  90%   95%   99%
#   0.04  0.06  0.06  0.09  0.12  0.15  0.2  0.21  0.24
# ----------------------------------------------------------------------------------------------------
# PERIMETER (numeric): 
# Statistics
#     N  Ndist  Mean    SD  Min   Max  Skew  Kurt
#   100     96  1.67  0.48    1  3.64  1.48  5.95
# Quantiles
#   1%    5%   10%   25%   50%   75%  90%   95%  99%
#    1  1.09  1.19  1.32  1.61  1.86  2.2  2.72  3.2
# ----------------------------------------------------------------------------------------------------
# CNTY_ (numeric): 
# Statistics
#     N  Ndist     Mean      SD   Min   Max  Skew  Kurt
#   100    100  1985.96  106.52  1825  2241  0.26  2.32
# Quantiles
#        1%       5%     10%      25%   50%      75%   90%     95%      99%
#   1826.98  1832.95  1837.9  1902.25  1982  2067.25  2110  2156.3  2238.03
# ----------------------------------------------------------------------------------------------------
# CNTY_ID (numeric): 
# Statistics
#     N  Ndist     Mean      SD   Min   Max  Skew  Kurt
#   100    100  1985.96  106.52  1825  2241  0.26  2.32
# Quantiles
#        1%       5%     10%      25%   50%      75%   90%     95%      99%
#   1826.98  1832.95  1837.9  1902.25  1982  2067.25  2110  2156.3  2238.03
# ----------------------------------------------------------------------------------------------------
# NAME (character): 
# Statistics
#     N  Ndist
#   100    100
# Table
#                Freq  Perc
# Ashe              1     1
# Alleghany         1     1
# Surry             1     1
# Currituck         1     1
# Northampton       1     1
# Hertford          1     1
# Camden            1     1
# Gates             1     1
# Warren            1     1
# Stokes            1     1
# Caswell           1     1
# Rockingham        1     1
# Granville         1     1
# Person            1     1
# ... 86 Others    86    86
# 
# Summary of Table Frequencies
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#       1       1       1       1       1       1 
# ----------------------------------------------------------------------------------------------------
# FIPS (character): 
# Statistics
#     N  Ndist
#   100    100
# Table
#                Freq  Perc
# 37009             1     1
# 37005             1     1
# 37171             1     1
# 37053             1     1
# 37131             1     1
# 37091             1     1
# 37029             1     1
# 37073             1     1
# 37185             1     1
# 37169             1     1
# 37033             1     1
# 37157             1     1
# 37077             1     1
# 37145             1     1
# ... 86 Others    86    86
# 
# Summary of Table Frequencies
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#       1       1       1       1       1       1 
# ----------------------------------------------------------------------------------------------------
# FIPSNO (numeric): 
# Statistics
#     N  Ndist   Mean     SD    Min    Max  Skew  Kurt
#   100    100  37100  58.02  37001  37199    -0   1.8
# Quantiles
#         1%       5%      10%      25%    50%      75%      90%      95%       99%
#   37002.98  37010.9  37020.8  37050.5  37100  37149.5  37179.2  37189.1  37197.02
# ----------------------------------------------------------------------------------------------------
# CRESS_ID (integer): 
# Statistics
#     N  Ndist  Mean     SD  Min  Max  Skew  Kurt
#   100    100  50.5  29.01    1  100     0   1.8
# Quantiles
#     1%    5%   10%    25%   50%    75%   90%    95%    99%
#   1.99  5.95  10.9  25.75  50.5  75.25  90.1  95.05  99.01
# ----------------------------------------------------------------------------------------------------
# BIR74 (numeric): 
# Statistics
#     N  Ndist     Mean       SD  Min    Max  Skew   Kurt
#   100    100  3299.62  3848.17  248  21588  2.79  11.79
# Quantiles
#       1%      5%    10%   25%     50%   75%     90%    95%       99%
#   283.64  419.75  531.8  1077  2180.5  3936  6725.7  11193  20378.22
# ----------------------------------------------------------------------------------------------------
# SID74 (numeric): 
# Statistics
#     N  Ndist  Mean    SD  Min  Max  Skew   Kurt
#   100     23  6.67  7.78    0   44  2.44  10.28
# Quantiles
#   1%  5%  10%  25%  50%   75%   90%    95%    99%
#    0   0    0    2    4  8.25  15.1  18.25  38.06
# ----------------------------------------------------------------------------------------------------
# NWBIR74 (numeric): 
# Statistics
#     N  Ndist     Mean       SD  Min   Max  Skew   Kurt
#   100     93  1050.81  1432.91    1  8027  2.83  11.84
# Quantiles
#   1%    5%   10%  25%    50%     75%     90%     95%      99%
#    1  9.95  39.2  190  697.5  1168.5  2231.8  3942.9  7052.84
# ----------------------------------------------------------------------------------------------------
# BIR79 (numeric): 
# Statistics
#     N  Ndist     Mean       SD  Min    Max  Skew  Kurt
#   100    100  4223.92  5179.46  319  30757  2.99  13.1
# Quantiles
#       1%     5%    10%      25%   50%   75%   90%       95%       99%
#   349.69  539.3  675.7  1336.25  2636  4889  8313  14707.45  26413.87
# ----------------------------------------------------------------------------------------------------
# SID79 (numeric): 
# Statistics
#     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#   100     28  8.36  9.43    0   57  2.28  9.88
# Quantiles
#   1%  5%  10%  25%  50%    75%  90%  95%    99%
#    0   0    1    2    5  10.25   21   26  38.19
# ----------------------------------------------------------------------------------------------------
# NWBIR79 (numeric): 
# Statistics
#     N  Ndist     Mean    SD  Min    Max  Skew   Kurt
#   100     98  1352.81  1976    3  11631  3.18  14.45
# Quantiles
#     1%    5%   10%    25%    50%      75%     90%     95%       99%
#   3.99  11.9  44.7  250.5  874.5  1406.75  2987.9  5090.5  10624.17
# ----------------------------------------------------------------------------------------------------

Selecting Columns and Subsetting

We can now also easily select columns from the sf data frame without having to worry about taking along ‘geometry’:

# Selecting a sequence of columns
fselect(nc, AREA, NAME:FIPSNO)
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA      NAME  FIPS FIPSNO                       geometry
# 1 0.114      Ashe 37009  37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005  37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143     Surry 37171  37171 MULTIPOLYGON (((-80.45634 3...

# Same using standard evaluation (gv is a shorthand for get_vars())
gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO"))
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA      NAME  FIPS FIPSNO                       geometry
# 1 0.114      Ashe 37009  37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005  37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143     Surry 37171  37171 MULTIPOLYGON (((-80.45634 3...

The same applies to subsetting rows (and columns):

# A fast and enhanced version of base::subset, now also supporting sf data 
fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO)
# Simple feature collection with 44 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA        NAME  FIPS FIPSNO                       geometry
# 1 0.143       Surry 37171  37171 MULTIPOLYGON (((-80.45634 3...
# 2 0.153 Northampton 37131  37131 MULTIPOLYGON (((-77.21767 3...
# 3 0.153  Rockingham 37157  37157 MULTIPOLYGON (((-79.53051 3...

# A fast version of `[` (where i is used and optionally j)
ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO"))
# Simple feature collection with 10 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA      NAME  FIPS FIPSNO                       geometry
# 1 0.114      Ashe 37009  37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005  37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143     Surry 37171  37171 MULTIPOLYGON (((-80.45634 3...

This is significantly faster than using [, dplyr::select or dplyr::filter:

library(microbenchmark)
library(dplyr)

# Selecting columns
microbenchmark(collapse = fselect(nc, AREA, NAME:FIPSNO), 
               dplyr = select(nc, AREA, NAME:FIPSNO),
               collapse2 = gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO")), 
               sf = nc[c("AREA", "NAME", "FIPS", "FIPSNO")])
# Unit: microseconds
#       expr      min        lq       mean    median        uq       max neval
#   collapse    7.564   13.5160   18.60561   19.3275   22.4470    50.190   100
#      dplyr 1268.180 1362.8425 1684.53575 1627.2525 1739.7145 11367.267   100
#  collapse2    6.146   11.5325   18.23224   17.0210   23.5485    54.845   100
#         sf  308.204  330.5045  397.05334  414.6825  431.7680   543.124   100
# Subsetting
microbenchmark(collapse = fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO), 
               dplyr = select(nc, AREA, NAME:FIPSNO) %>% filter(AREA > fmean(AREA)),
               collapse2 = ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO")), 
               sf = nc[1:10, c("AREA", "NAME", "FIPS", "FIPSNO")])
# Unit: microseconds
#       expr      min        lq       mean    median        uq       max neval
#   collapse   24.518   30.8095   42.88230   44.9250   51.2155    70.610   100
#      dplyr 2390.549 2473.2795 2675.66844 2566.2565 2671.6470 10693.281   100
#  collapse2    6.605    8.5375   16.20885   16.1560   22.1540    61.942   100
#         sf  489.445  509.8915  575.05397  532.3785  549.2700  4366.781   100

It needs to be noted that at this point collapse functions don’t subset the ‘agr’ attribute on selecting columns, which (if specified) relates columns (attributes) to the geometry, and also don’t modify the ‘bbox’ attribute giving the overall boundaries of a set of geometries when subsetting the sf data frame. Keeping the full ‘agr’ attribute is not problematic for all practical purposes, but not changing ‘bbox’ upon subsetting may lead to too large margins when plotting the geometries of a subsetted sf data frame.

Aggregation and Grouping

The flexibility and speed of collap for aggregation can be used on sf data frames. A separate method for sf objects was not considered necessary as one can simply aggregate the geometry column using st_union:

# Aggregating by variable SID74 using the median for numeric and the mode for categorical columns
collap(nc, ~ SID74, custom = list(fmedian = is.numeric, 
                                  fmode = is.character, 
                                  st_union = "geometry")) # or use is.list to fetch the geometry
# Simple feature collection with 23 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#     AREA PERIMETER  CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 SID74 NWBIR74  BIR79
# 1 0.0780    1.3070 1950.0  1950.0 Alleghany 37005  37073     37.0   487     0     0    40.0  594.0
# 2 0.0810    1.2880 1887.0  1887.0      Ashe 37009  37137     69.0   751     1     1   148.0  899.0
# 3 0.1225    1.6435 1959.5  1959.5   Caswell 37033  37078     39.5  1271     2     2   382.5 1676.5
#   SID79 NWBIR79                       geometry
# 1     1      45 MULTIPOLYGON (((-83.69563 3...
# 2     1     176 MULTIPOLYGON (((-80.02406 3...
# 3     2     452 MULTIPOLYGON (((-77.16129 3...

sf data frames can also be grouped and then aggregated using fsummarise:

nc %>% fgroup_by(SID74)
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091     1      10  1364     0
# 2 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487     0      10   542     3
# 3 0.143     1.630  1828    1828     Surry 37171  37171       86  3188     5     208  3616     6
#   NWBIR79                       geometry
# 1      19 MULTIPOLYGON (((-81.47276 3...
# 2      12 MULTIPOLYGON (((-81.23989 3...
# 3     260 MULTIPOLYGON (((-80.45634 3...
# 
# Grouped by:  SID74  [23 | 4 (4) 1-13]

nc %>% fgroup_by(SID74) %>%
  fsummarise(AREA_Ag = fsum(AREA), 
             Perimeter_Ag = fmedian(PERIMETER),
             geometry = st_union(geometry))
# Simple feature collection with 23 features and 3 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#   SID74 AREA_Ag Perimeter_Ag                       geometry
# 1     0   1.103       1.3070 MULTIPOLYGON (((-83.69563 3...
# 2     1   0.914       1.2880 MULTIPOLYGON (((-80.02406 3...
# 3     2   1.047       1.6435 MULTIPOLYGON (((-77.16129 3...

It needs to be noted here that typically most of the time in aggregation is consumed by st_union so that the speed of collapse does not really become visible on most datasets. One exception is spatial panel data when aggregating over the time-dimension with identical geometries for each id over time. Such panels can quickly be aggregated using ffirst or flast to aggregate the geometry:

# Creating a panel-dataset by simply duplicating nc for 2 different years
pnc <- list(`2000` = nc, `2001` = nc) %>% unlist2d("Year") %>% copyMostAttrib(nc)
pnc 
# Simple feature collection with 200 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#   Year  AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79
# 1 2000 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091     1      10  1364
# 2 2000 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487     0      10   542
# 3 2000 0.143     1.630  1828    1828     Surry 37171  37171       86  3188     5     208  3616
#   SID79 NWBIR79                       geometry
# 1     0      19 MULTIPOLYGON (((-81.47276 3...
# 2     3      12 MULTIPOLYGON (((-81.23989 3...
# 3     6     260 MULTIPOLYGON (((-80.45634 3...

# Aggregating by NAME, using the last value for all categorical data
collap(pnc, ~ NAME, fmedian, catFUN = flast, cols = -1L)
# Simple feature collection with 100 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA PERIMETER CNTY_ CNTY_ID      NAME      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79
# 1 0.111     1.392  1904    1904  Alamance  Alamance 37001  37001        1  4672    13    1243  5767
# 2 0.066     1.070  1950    1950 Alexander Alexander 37003  37003        2  1333     0     128  1683
# 3 0.061     1.231  1827    1827 Alleghany Alleghany 37005  37005        3   487     0      10   542
#   SID79 NWBIR79                       geometry
# 1    11    1397 MULTIPOLYGON (((-79.24619 3...
# 2     2     150 MULTIPOLYGON (((-81.10889 3...
# 3     3      12 MULTIPOLYGON (((-81.23989 3...

# Using fsummarise to aggregate just two variables and the geometry
pnc_ag <- pnc %>% fgroup_by(NAME) %>%
  fsummarise(AREA_Ag = fsum(AREA), 
             Perimeter_Ag = fmedian(PERIMETER),
             geometry = flast(geometry))

# The geometry is still valid...
plot(slt(pnc_ag, AREA_Ag))

Unique Values, Ordering, Splitting, Binding

Functions funique and roworder(v) work on sf data frames and ignore the geometry column for determining the unique values / order of rows. rsplit can be used to (recursively) split an sf data frame into multiple chunks.

# Splitting by SID74
rsplit(nc, ~ SID74) %>% head(2)
# $`0`
# Simple feature collection with 13 features and 13 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 NWBIR74 BIR79 SID79 NWBIR79
# 1 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487      10   542     3      12
# 2 0.062     1.547  1834    1834    Camden 37029  37029       15   286     115   350     2     139
# 3 0.091     1.284  1835    1835     Gates 37073  37073       37   420     254   594     2     371
#                         geometry
# 1 MULTIPOLYGON (((-81.23989 3...
# 2 MULTIPOLYGON (((-76.00897 3...
# 3 MULTIPOLYGON (((-76.56251 3...
# 
# $`1`
# Simple feature collection with 11 features and 13 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 NWBIR74 BIR79 SID79 NWBIR79
# 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091      10  1364     0      19
# 2 0.070     2.968  1831    1831 Currituck 37053  37053       27   508     123   830     2     145
# 3 0.124     1.428  1837    1837    Stokes 37169  37169       85  1612     160  2038     5     176
#                         geometry
# 1 MULTIPOLYGON (((-81.47276 3...
# 2 MULTIPOLYGON (((-76.00897 3...
# 3 MULTIPOLYGON (((-80.02567 3...

It should be noted here that the default in rsplit for data frames is simplify = TRUE, which for a single LHS variable would just split the column-vector. This does not apply to sf data frames as the ‘geometry’ column is always selected as well.

# Only splitting Area
rsplit(nc, AREA ~ SID74) %>% head(1)
# $`0`
# Simple feature collection with 13 features and 1 field
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA                       geometry
# 1 0.061 MULTIPOLYGON (((-81.23989 3...
# 2 0.062 MULTIPOLYGON (((-76.00897 3...
# 3 0.091 MULTIPOLYGON (((-76.56251 3...

# For data frames the default simplify = TRUE drops the data frame structure
rsplit(qDF(nc), AREA ~ SID74) %>% head(1)
# $`0`
#  [1] 0.061 0.062 0.091 0.064 0.059 0.080 0.066 0.099 0.094 0.078 0.131 0.167 0.051

sf data frames can be combined using rowbind(), which will preserve the attributes of the first object.

# Splitting by each row and recombining
nc_combined <- nc %>% rsplit(seq_row(.)) %>% rowbind() 
identical(nc, nc_combined)
# [1] TRUE

Transformations

For transforming and computing columns, ftransform(v) and settransform(v) apply as to any other data frame.

ftransform(nc, scaled_AREA = fscale(AREA), 
               gsum_AREA = fsum(AREA, SID74, TRA = "replace_fill"))
# Simple feature collection with 100 features and 16 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091     1      10  1364     0
# 2 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487     0      10   542     3
# 3 0.143     1.630  1828    1828     Surry 37171  37171       86  3188     5     208  3616     6
#   NWBIR79                       geometry scaled_AREA gsum_AREA
# 1      19 MULTIPOLYGON (((-81.47276 3...  -0.2491860     0.914
# 2      12 MULTIPOLYGON (((-81.23989 3...  -1.3264176     1.103
# 3     260 MULTIPOLYGON (((-80.45634 3...   0.3402426     1.380

# settransform materializes the change, same as nc <- ftransform(nc, ...)
settransform(nc, scaled_AREA = fscale(AREA), 
                 gsum_AREA = fsum(AREA, SID74, TRA = "replace_fill"))
nc
# Simple feature collection with 100 features and 16 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091     1      10  1364     0
# 2 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487     0      10   542     3
# 3 0.143     1.630  1828    1828     Surry 37171  37171       86  3188     5     208  3616     6
#   NWBIR79                       geometry scaled_AREA gsum_AREA
# 1      19 MULTIPOLYGON (((-81.47276 3...  -0.2491860     0.914
# 2      12 MULTIPOLYGON (((-81.23989 3...  -1.3264176     1.103
# 3     260 MULTIPOLYGON (((-80.45634 3...   0.3402426     1.380

Special attention to sf data frames is afforded by fcompute, which can be used to compute new columns dropping existing ones - except for the geometry column and any columns selected through the keep argument.

fcompute(nc, scaled_AREA = fscale(AREA), 
             gsum_AREA = fsum(AREA, SID74, TRA = "replace_fill"), 
         keep = .c(AREA, SID74))
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#    AREA SID74 scaled_AREA gsum_AREA                       geometry
# 1 0.114     1  -0.2491860     0.914 MULTIPOLYGON (((-81.47276 3...
# 2 0.061     0  -1.3264176     1.103 MULTIPOLYGON (((-81.23989 3...
# 3 0.143     5   0.3402426     1.380 MULTIPOLYGON (((-80.45634 3...

Conversion to and from sf

The quick converters qDF, qDT and qTBL can be used to very efficiently convert sf data frames to standard data frames, data.table’s or tibbles, and the result can be converted back to the original sf data frame using setAttrib, copyAttrib or copyMostAttrib.

library(data.table)
qDT(nc)[, list(roll_AREA = frollmean(AREA, 2), geometry), by = SID74] %>% copyMostAttrib(nc)
# Simple feature collection with 100 features and 2 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS:  NAD27
# First 3 features:
#   SID74 roll_AREA                       geometry
# 1     1        NA MULTIPOLYGON (((-81.47276 3...
# 2     1     0.092 MULTIPOLYGON (((-76.00897 3...
# 3     1     0.097 MULTIPOLYGON (((-80.02567 3...

Conclusion

collapse provides no deep integration with the sf ecosystem, but sufficient features and flexibility to painlessly manipulate sf data frames with much greater performance.