collapse and sf
Fast Manipulation of Simple Features Data Frames
Sebastian Krantz and Grant McDermott
2024-04-19
Source:vignettes/collapse_and_sf.Rmd
collapse_and_sf.Rmd
This short vignette focuses on using collapse with the popular sf package by Edzer Pebesma. It shows that collapse supports easy manipulation of sf data frames, at computation speeds far above dplyr.
collapse v1.6.0 added internal support for sf data
frames by having most essential functions (e.g.,
fselect/gv
, fsubset/ss
,
fgroup_by
, findex_by
, qsu
,
descr
, varying
, funique
,
roworder
, rsplit
, fcompute
, …)
internally handle the geometry column.
To demonstrate this, we can load a test dataset provided by sf:
library(collapse)
library(sf)
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
options(sf_max_print = 3)
nc
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry
# 1 19 MULTIPOLYGON (((-81.47276 3...
# 2 12 MULTIPOLYGON (((-81.23989 3...
# 3 260 MULTIPOLYGON (((-80.45634 3...
Summarising sf Data Frames
Computing summary statistics on sf data frames automatically excludes the ‘geometry’ column:
# Which columns have at least 2 non-missing distinct values
varying(nc)
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# NWBIR74 BIR79 SID79 NWBIR79
# TRUE TRUE TRUE TRUE
# Quick summary stats
qsu(nc)
# N Mean SD Min Max
# AREA 100 0.1263 0.0492 0.042 0.241
# PERIMETER 100 1.673 0.4823 0.999 3.64
# CNTY_ 100 1985.96 106.5166 1825 2241
# CNTY_ID 100 1985.96 106.5166 1825 2241
# NAME 100 - - - -
# FIPS 100 - - - -
# FIPSNO 100 37100 58.023 37001 37199
# CRESS_ID 100 50.5 29.0115 1 100
# BIR74 100 3299.62 3848.1651 248 21588
# SID74 100 6.67 7.7812 0 44
# NWBIR74 100 1050.81 1432.9117 1 8027
# BIR79 100 4223.92 5179.4582 319 30757
# SID79 100 8.36 9.4319 0 57
# NWBIR79 100 1352.81 1975.9988 3 11631
# Detailed statistics description of each column
descr(nc)
# Dataset: nc, 14 Variables, N = 100
# ----------------------------------------------------------------------------------------------------
# AREA (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 77 0.13 0.05 0.04 0.24 0.48 2.5
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 0.04 0.06 0.06 0.09 0.12 0.15 0.2 0.21 0.24
# ----------------------------------------------------------------------------------------------------
# PERIMETER (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 96 1.67 0.48 1 3.64 1.48 5.95
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 1 1.09 1.19 1.32 1.61 1.86 2.2 2.72 3.2
# ----------------------------------------------------------------------------------------------------
# CNTY_ (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 1985.96 106.52 1825 2241 0.26 2.32
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 1826.98 1832.95 1837.9 1902.25 1982 2067.25 2110 2156.3 2238.03
# ----------------------------------------------------------------------------------------------------
# CNTY_ID (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 1985.96 106.52 1825 2241 0.26 2.32
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 1826.98 1832.95 1837.9 1902.25 1982 2067.25 2110 2156.3 2238.03
# ----------------------------------------------------------------------------------------------------
# NAME (character):
# Statistics
# N Ndist
# 100 100
# Table
# Freq Perc
# Ashe 1 1
# Alleghany 1 1
# Surry 1 1
# Currituck 1 1
# Northampton 1 1
# Hertford 1 1
# Camden 1 1
# Gates 1 1
# Warren 1 1
# Stokes 1 1
# Caswell 1 1
# Rockingham 1 1
# Granville 1 1
# Person 1 1
# ... 86 Others 86 86
#
# Summary of Table Frequencies
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1 1 1 1 1 1
# ----------------------------------------------------------------------------------------------------
# FIPS (character):
# Statistics
# N Ndist
# 100 100
# Table
# Freq Perc
# 37009 1 1
# 37005 1 1
# 37171 1 1
# 37053 1 1
# 37131 1 1
# 37091 1 1
# 37029 1 1
# 37073 1 1
# 37185 1 1
# 37169 1 1
# 37033 1 1
# 37157 1 1
# 37077 1 1
# 37145 1 1
# ... 86 Others 86 86
#
# Summary of Table Frequencies
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1 1 1 1 1 1
# ----------------------------------------------------------------------------------------------------
# FIPSNO (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 37100 58.02 37001 37199 -0 1.8
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 37002.98 37010.9 37020.8 37050.5 37100 37149.5 37179.2 37189.1 37197.02
# ----------------------------------------------------------------------------------------------------
# CRESS_ID (integer):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 50.5 29.01 1 100 0 1.8
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 1.99 5.95 10.9 25.75 50.5 75.25 90.1 95.05 99.01
# ----------------------------------------------------------------------------------------------------
# BIR74 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 3299.62 3848.17 248 21588 2.79 11.79
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 283.64 419.75 531.8 1077 2180.5 3936 6725.7 11193 20378.22
# ----------------------------------------------------------------------------------------------------
# SID74 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 23 6.67 7.78 0 44 2.44 10.28
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 0 0 0 2 4 8.25 15.1 18.25 38.06
# ----------------------------------------------------------------------------------------------------
# NWBIR74 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 93 1050.81 1432.91 1 8027 2.83 11.84
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 1 9.95 39.2 190 697.5 1168.5 2231.8 3942.9 7052.84
# ----------------------------------------------------------------------------------------------------
# BIR79 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 4223.92 5179.46 319 30757 2.99 13.1
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 349.69 539.3 675.7 1336.25 2636 4889 8313 14707.45 26413.87
# ----------------------------------------------------------------------------------------------------
# SID79 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 28 8.36 9.43 0 57 2.28 9.88
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 0 0 1 2 5 10.25 21 26 38.19
# ----------------------------------------------------------------------------------------------------
# NWBIR79 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 98 1352.81 1976 3 11631 3.18 14.45
# Quantiles
# 1% 5% 10% 25% 50% 75% 90% 95% 99%
# 3.99 11.9 44.7 250.5 874.5 1406.75 2987.9 5090.5 10624.17
# ----------------------------------------------------------------------------------------------------
Selecting Columns and Subsetting
We can select columns from the sf data frame without having to worry about taking along ‘geometry’:
# Selecting a sequence of columns
fselect(nc, AREA, NAME:FIPSNO)
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
# Same using standard evaluation (gv is a shorthand for get_vars())
gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO"))
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
The same applies to subsetting rows (and columns):
# A fast and enhanced version of base::subset
fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO)
# Simple feature collection with 44 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
# 2 0.153 Northampton 37131 37131 MULTIPOLYGON (((-77.21767 3...
# 3 0.153 Rockingham 37157 37157 MULTIPOLYGON (((-79.53051 3...
# A fast version of `[` (where i is used and optionally j)
ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO"))
# Simple feature collection with 10 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
This is significantly faster than using [
,
base::subset()
, dplyr::select()
or
dplyr::filter()
:
library(microbenchmark)
library(dplyr)
# Selecting columns
microbenchmark(collapse = fselect(nc, AREA, NAME:FIPSNO),
dplyr = select(nc, AREA, NAME:FIPSNO),
collapse2 = gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO")),
sf = nc[c("AREA", "NAME", "FIPS", "FIPSNO")])
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 3.034 3.9565 5.19429 5.1865 5.6990 22.878 100
# dplyr 431.279 452.2915 505.29015 466.3750 493.8450 3356.342 100
# collapse2 2.665 3.4850 4.59610 4.4075 5.0635 14.391 100
# sf 105.165 114.1235 120.39732 118.0390 124.9270 156.497 100
# Subsetting
microbenchmark(collapse = fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO),
dplyr = select(nc, AREA, NAME:FIPSNO) |> filter(AREA > fmean(AREA)),
collapse2 = ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO")),
sf = nc[1:10, c("AREA", "NAME", "FIPS", "FIPSNO")])
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 9.676 11.5825 15.01707 14.4730 16.8920 30.463 100
# dplyr 890.643 917.6415 1055.40970 941.7085 1009.7890 5546.685 100
# collapse2 2.829 3.5465 5.40585 4.8995 6.4165 20.541 100
# sf 176.997 187.6160 202.72286 200.7565 210.8220 340.464 100
However, collapse functions don’t subset the ‘agr’ attribute on selecting columns, which (if specified) relates columns (attributes) to the geometry, and also don’t modify the ‘bbox’ attribute giving the overall boundaries of a set of geometries when subsetting the sf data frame. Keeping the full ‘agr’ attribute is not problematic for all practical purposes, but not changing ‘bbox’ upon subsetting may lead to too large margins when plotting the geometries of a subset sf data frame.
One way to to change this is calling st_make_valid()
on
the subset frame; but st_make_valid()
is very expensive,
thus unless the subset frame is very small, it is better to use
[
, base::subset()
or
dplyr::filter()
in cases where the bounding box size
matters.
Aggregation and Grouping
The flexibility and speed of collap()
for aggregation
can be used on sf data frames. A separate method for
sf objects was not considered necessary as one can simply
aggregate the geometry column using st_union()
:
# Aggregating by variable SID74 using the median for numeric and the mode for categorical columns
collap(nc, ~ SID74, custom = list(fmedian = is.numeric,
fmode = is.character,
st_union = "geometry")) # or use is.list to fetch the geometry
# Simple feature collection with 23 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 SID74 NWBIR74 BIR79
# 1 0.0780 1.3070 1950.0 1950.0 Alleghany 37005 37073 37.0 487 0 0 40.0 594.0
# 2 0.0810 1.2880 1887.0 1887.0 Ashe 37009 37137 69.0 751 1 1 148.0 899.0
# 3 0.1225 1.6435 1959.5 1959.5 Caswell 37033 37078 39.5 1271 2 2 382.5 1676.5
# SID79 NWBIR79 geometry
# 1 1 45 MULTIPOLYGON (((-83.69563 3...
# 2 1 176 MULTIPOLYGON (((-80.02406 3...
# 3 2 452 MULTIPOLYGON (((-77.16129 3...
sf data frames can also be grouped and then aggregated using
fsummarise()
:
nc |> fgroup_by(SID74)
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry
# 1 19 MULTIPOLYGON (((-81.47276 3...
# 2 12 MULTIPOLYGON (((-81.23989 3...
# 3 260 MULTIPOLYGON (((-80.45634 3...
#
# Grouped by: SID74 [23 | 4 (4) 1-13]
nc |>
fgroup_by(SID74) |>
fsummarise(AREA_Ag = fsum(AREA),
Perimeter_Ag = fmedian(PERIMETER),
geometry = st_union(geometry))
# Simple feature collection with 23 features and 3 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# SID74 AREA_Ag Perimeter_Ag geometry
# 1 0 1.103 1.3070 MULTIPOLYGON (((-83.69563 3...
# 2 1 0.914 1.2880 MULTIPOLYGON (((-80.02406 3...
# 3 2 1.047 1.6435 MULTIPOLYGON (((-77.16129 3...
Typically most of the time in aggregation is consumed by
st_union()
so that the speed of collapse does not
really become visible on most datasets. A faster alternative is to use
geos (sf backend for planar geometries) or s2
(sf backend for spherical geometries) directly:
# Using s2 backend: sensible for larger tasks
nc |>
fmutate(geometry = s2::as_s2_geography(geometry)) |>
fgroup_by(SID74) |>
fsummarise(AREA_Ag = fsum(AREA),
Perimeter_Ag = fmedian(PERIMETER),
geometry = s2::s2_union_agg(geometry)) |>
fmutate(geometry = st_as_sfc(geometry))
# Simple feature collection with 23 features and 3 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: WGS 84
# First 3 features:
# SID74 AREA_Ag Perimeter_Ag geometry
# 1 0 1.103 1.3070 MULTIPOLYGON (((-83.69563 3...
# 2 1 0.914 1.2880 MULTIPOLYGON (((-80.02406 3...
# 3 2 1.047 1.6435 MULTIPOLYGON (((-77.16129 3...
In general, also upon aggregation with collapse, functions
st_as_sfc()
, st_as_sf()
, or, in the worst
case, st_make_valid()
, may need to be invoked to ensure
valid sf object output. Functions collap()
and
fsummarise()
are attribute preserving but do not give
special regard to geometry columns.
One exception that both avoids the high cost of spatial functions in
aggregation and any need for ex-post conversion/validation is
aggregating spatial panel data over the time-dimension. Such panels can
quickly be aggregated using ffirst()
or
flast()
to aggregate the geometry:
# Creating a panel-dataset by simply duplicating nc for 2 different years
pnc <- rowbind(`2000` = nc, `2001` = nc, idcol = "Year") |> as_integer_factor()
pnc
# Simple feature collection with 200 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# Year AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79
# 1 2000 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364
# 2 2000 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542
# 3 2000 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616
# SID79 NWBIR79 geometry
# 1 0 19 MULTIPOLYGON (((-81.47276 3...
# 2 3 12 MULTIPOLYGON (((-81.23989 3...
# 3 6 260 MULTIPOLYGON (((-80.45634 3...
# Aggregating by NAME, using the last value for all categorical data
collap(pnc, ~ NAME, fmedian, catFUN = flast, cols = -1L)
# Simple feature collection with 100 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79
# 1 0.111 1.392 1904 1904 Alamance Alamance 37001 37001 1 4672 13 1243 5767
# 2 0.066 1.070 1950 1950 Alexander Alexander 37003 37003 2 1333 0 128 1683
# 3 0.061 1.231 1827 1827 Alleghany Alleghany 37005 37005 3 487 0 10 542
# SID79 NWBIR79 geometry
# 1 11 1397 MULTIPOLYGON (((-79.24619 3...
# 2 2 150 MULTIPOLYGON (((-81.10889 3...
# 3 3 12 MULTIPOLYGON (((-81.23989 3...
# Using fsummarise to aggregate just two variables and the geometry
pnc_ag <- pnc |>
fgroup_by(NAME) |>
fsummarise(AREA_Ag = fsum(AREA),
Perimeter_Ag = fmedian(PERIMETER),
geometry = flast(geometry))
# The geometry is still valid... (slt = shorthand for fselect)
plot(slt(pnc_ag, AREA_Ag))
Indexing
sf data frames can also become indexed frames (spatio-temporal panels):
pnc <- pnc |> findex_by(CNTY_ID, Year)
pnc
# Simple feature collection with 200 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# Year AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79
# 1 2000 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364
# 2 2000 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542
# 3 2000 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616
# SID79 NWBIR79 geometry
# 1 0 19 MULTIPOLYGON (((-81.47276 3...
# 2 3 12 MULTIPOLYGON (((-81.23989 3...
# 3 6 260 MULTIPOLYGON (((-80.45634 3...
#
# Indexed by: CNTY_ID [100] | Year [2]
qsu(pnc$AREA)
# N/T Mean SD Min Max
# Overall 200 0.1263 0.0491 0.042 0.241
# Between 100 0.1263 0.0492 0.042 0.241
# Within 2 0.1263 0 0.1263 0.1263
settransform(pnc, AREA_diff = fdiff(AREA))
psmat(pnc$AREA_diff) |> head()
# 2000 2001
# 1825 NA 0
# 1827 NA 0
# 1828 NA 0
# 1831 NA 0
# 1832 NA 0
# 1833 NA 0
pnc <- unindex(pnc)
Unique Values, Ordering, Splitting, Binding
Functions funique()
and roworder[v]()
ignore the ‘geometry’ column in determining the unique values / order of
rows when applied to sf data frames. rsplit()
can
be used to (recursively) split an sf data frame into multiple
chunks.
# Splitting by SID74
rsplit(nc, ~ SID74) |> head(2)
# $`0`
# Simple feature collection with 13 features and 13 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 NWBIR74 BIR79 SID79 NWBIR79
# 1 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 10 542 3 12
# 2 0.062 1.547 1834 1834 Camden 37029 37029 15 286 115 350 2 139
# 3 0.091 1.284 1835 1835 Gates 37073 37073 37 420 254 594 2 371
# geometry
# 1 MULTIPOLYGON (((-81.23989 3...
# 2 MULTIPOLYGON (((-76.00897 3...
# 3 MULTIPOLYGON (((-76.56251 3...
#
# $`1`
# Simple feature collection with 11 features and 13 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 NWBIR74 BIR79 SID79 NWBIR79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 10 1364 0 19
# 2 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 123 830 2 145
# 3 0.124 1.428 1837 1837 Stokes 37169 37169 85 1612 160 2038 5 176
# geometry
# 1 MULTIPOLYGON (((-81.47276 3...
# 2 MULTIPOLYGON (((-76.00897 3...
# 3 MULTIPOLYGON (((-80.02567 3...
The default in rsplit()
for data frames is
simplify = TRUE
, which, for a single LHS variable, would
just split the column-vector. This does not apply to sf data
frames as the ‘geometry’ column is always selected as well.
# Only splitting Area
rsplit(nc, AREA ~ SID74) |> head(1)
# $`0`
# Simple feature collection with 13 features and 1 field
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA geometry
# 1 0.061 MULTIPOLYGON (((-81.23989 3...
# 2 0.062 MULTIPOLYGON (((-76.00897 3...
# 3 0.091 MULTIPOLYGON (((-76.56251 3...
# For data frames the default simplify = TRUE drops the data frame structure
rsplit(qDF(nc), AREA ~ SID74) |> head(1)
# $`0`
# [1] 0.061 0.062 0.091 0.064 0.059 0.080 0.066 0.099 0.094 0.078 0.131 0.167 0.051
sf data frames can be combined using rowbind()
,
which, by default, preserves the attributes of the first object.
Transformations
For transforming and computing columns, fmutate()
and
ftransform[v]()
apply as to any other data frame.
fmutate(nc, gsum_AREA = fsum(AREA, SID74, TRA = "fill")) |> head()
# Simple feature collection with 6 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry gsum_AREA
# 1 19 MULTIPOLYGON (((-81.47276 3... 0.914
# 2 12 MULTIPOLYGON (((-81.23989 3... 1.103
# 3 260 MULTIPOLYGON (((-80.45634 3... 1.380
# Same thing, more expensive
nc |> fgroup_by(SID74) |> fmutate(gsum_AREA = fsum(AREA)) |> fungroup() |> head()
# Simple feature collection with 6 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry gsum_AREA
# 1 19 MULTIPOLYGON (((-81.47276 3... 0.914
# 2 12 MULTIPOLYGON (((-81.23989 3... 1.103
# 3 260 MULTIPOLYGON (((-80.45634 3... 1.380
Special attention to sf data frames is afforded by
fcompute()
, which can be used to compute new columns
dropping existing ones - except for the geometry column and any columns
selected through the keep
argument.
fcompute(nc, scaled_AREA = fscale(AREA),
gsum_AREA = fsum(AREA, SID74, TRA = "fill"),
keep = .c(AREA, SID74))
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA SID74 scaled_AREA gsum_AREA geometry
# 1 0.114 1 -0.2491860 0.914 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 0 -1.3264176 1.103 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 5 0.3402426 1.380 MULTIPOLYGON (((-80.45634 3...
Conversion to and from sf
The quick converters qDF()
, qDT()
, and
qTBL()
can be used to efficiently convert sf data
frames to standard data frames, data.table’s or
tibbles, and the result can be converted back to the original
sf data frame using setAttrib()
,
copyAttrib()
or copyMostAttrib()
.
library(data.table)
# Create a data.table on the fly to do an fast grouped rolling mean and back to sf
qDT(nc)[, list(roll_AREA = frollmean(AREA, 2), geometry), by = SID74] |> copyMostAttrib(nc)
# Simple feature collection with 100 features and 2 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# SID74 roll_AREA geometry
# 1 1 NA MULTIPOLYGON (((-81.47276 3...
# 2 1 0.092 MULTIPOLYGON (((-76.00897 3...
# 3 1 0.097 MULTIPOLYGON (((-80.02567 3...
The easiest way to strip a geometry column off an sf data
frame is via the function atomic_elem()
, which removes
list-like columns and, by default, also the class attribute. For
example, we can create a data.table without list column
using
qDT(atomic_elem(nc)) |> head()
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# <num> <num> <num> <num> <char> <char> <num> <int> <num> <num> <num> <num> <num>
# 1: 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2: 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3: 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# 4: 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1 123 830 2
# 5: 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9 1066 1606 3
# 6: 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7 954 1838 5
# NWBIR79
# <num>
# 1: 19
# 2: 12
# 3: 260
# 4: 145
# 5: 1197
# 6: 1237
This is also handy for other functions such as join()
and pivot()
, which are class agnostic like all of
collapse, but do not have any built-in logic to deal with the
sf column.
# Use atomic_elem() to strip geometry off y in left join
identical(nc, join(nc, atomic_elem(nc), overid = 2))
# left join: nc[AREA, PERIMETER, CNTY_, CNTY_ID, NAME, FIPS, FIPSNO, CRESS_ID, BIR74, SID74, NWBIR74, BIR79, SID79, NWBIR79] 100/100 (100%) <m:m> y[AREA, PERIMETER, CNTY_, CNTY_ID, NAME, FIPS, FIPSNO, CRESS_ID, BIR74, SID74, NWBIR74, BIR79, SID79, NWBIR79] 100/100 (100%)
# [1] TRUE
# In pivot: presently need to specify what to do with geometry column
pivot(nc, c("CNTY_ID", "geometry")) |> head()
# Simple feature collection with 6 features and 3 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# CNTY_ID geometry variable value
# 1 1825 MULTIPOLYGON (((-81.47276 3... AREA 0.114
# 2 1827 MULTIPOLYGON (((-81.23989 3... AREA 0.061
# 3 1828 MULTIPOLYGON (((-80.45634 3... AREA 0.143
# Or use
pivot(qDT(atomic_elem(nc)), "CNTY_ID") |> head()
# CNTY_ID variable value
# <num> <fctr> <char>
# 1: 1825 AREA 0.114
# 2: 1827 AREA 0.061
# 3: 1828 AREA 0.143
# 4: 1831 AREA 0.07
# 5: 1832 AREA 0.153
# 6: 1833 AREA 0.097
Support for units
Since v2.0.13, collapse explicitly supports/preserves units objects through dedicated methods that preserve the ‘units’ class wherever sensible.
nc_dist <- st_centroid(nc) |> st_distance()
nc_dist[1:3, 1:3]
# Units: [m]
# [,1] [,2] [,3]
# [1,] 0.00 34020.35 72728.02
# [2,] 34020.35 0.00 40259.55
# [3,] 72728.02 40259.55 0.00
fmean(nc_dist) |> head()
# Units: [m]
# [1] 250543.9 237040.0 217941.5 337016.5 250380.2 269604.6
fndistinct(nc_dist) |> head()
# [1] 100 100 100 100 100 100
Conclusion
collapse provides no deep integration with the sf ecosystem and cannot perform spatial operations, but offers sufficient features and flexibility to painlessly manipulate sf data frames at much greater speeds than dplyr.
This requires a bit of care by the user though to ensure that the returned sf objects are valid, especially following aggregation and subsetting.