vignettes/collapse_and_sf.Rmd
collapse_and_sf.Rmd
This short vignette focuses on using collapse with the popular sf package by Edzer Pebesma. It shows that collapse supports easy manipulation of sf data frames, at computation speeds far above dplyr.
collapse is class-agnostic and can theoretically be used to
manipulate any kind of list, matrix or data frame based object. Previous
versions of collapse could manipulate sf data frames,
but required users to manually include the geometry column when using
functions like in fselect
and fsubset
, and
exclude it in other functions like qsu
.
Version 1.6.0 of the package adds internal support for sf
data frames where the geometry column is handled by most essential
functions (fselect
, get_vars
,
num_vars
, fsubset
, ss
,
fgroup_by
, qsu
, descr
,
varying
, funique
, roworder
,
rsplit
, fcompute
, …). This largely does not
affect previously written code where the geometry column is explicitly
selected. To demonstrate these features, we can load a test dataset
provided by sf:
library(collapse)
library(sf)
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
options(sf_max_print = 3)
nc
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry
# 1 19 MULTIPOLYGON (((-81.47276 3...
# 2 12 MULTIPOLYGON (((-81.23989 3...
# 3 260 MULTIPOLYGON (((-80.45634 3...
The first thing that collapse 1.6.0 adds is summary statistics for sf data frames that automatically exclude the ‘geometry’ column:
# Which columns have at least 2 non-missing distinct values
varying(nc)
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# NWBIR74 BIR79 SID79 NWBIR79
# TRUE TRUE TRUE TRUE
# Quick summary stats
qsu(nc)
# N Mean SD Min Max
# AREA 100 0.1263 0.0492 0.042 0.241
# PERIMETER 100 1.673 0.4823 0.999 3.64
# CNTY_ 100 1985.96 106.5166 1825 2241
# CNTY_ID 100 1985.96 106.5166 1825 2241
# NAME 100 - - - -
# FIPS 100 - - - -
# FIPSNO 100 37100 58.023 37001 37199
# CRESS_ID 100 50.5 29.0115 1 100
# BIR74 100 3299.62 3848.1651 248 21588
# SID74 100 6.67 7.7812 0 44
# NWBIR74 100 1050.81 1432.9117 1 8027
# BIR79 100 4223.92 5179.4582 319 30757
# SID79 100 8.36 9.4319 0 57
# NWBIR79 100 1352.81 1975.9988 3 11631
# Detailed statistics description of each column
descr(nc)
# Dataset: nc, 14 Variables, N = 100
# ----------------------------------------------------------------------------------------------------
# AREA (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 77 0.13 0.05 0.04 0.24 0.48 2.5
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 0.04 0.06 0.09 0.12 0.15 0.21 0.24
# ----------------------------------------------------------------------------------------------------
# PERIMETER (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 96 1.67 0.48 1 3.64 1.48 5.95
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 1 1.09 1.32 1.61 1.86 2.72 3.2
# ----------------------------------------------------------------------------------------------------
# CNTY_ (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 1985.96 106.52 1825 2241 0.26 2.32
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 1826.98 1832.95 1902.25 1982 2067.25 2156.3 2238.03
# ----------------------------------------------------------------------------------------------------
# CNTY_ID (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 1985.96 106.52 1825 2241 0.26 2.32
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 1826.98 1832.95 1902.25 1982 2067.25 2156.3 2238.03
# ----------------------------------------------------------------------------------------------------
# NAME (character):
# Statistics
# N Ndist
# 100 100
# Table
# Freq Perc
# Ashe 1 1
# Alleghany 1 1
# Surry 1 1
# Currituck 1 1
# Northampton 1 1
# Hertford 1 1
# Camden 1 1
# Gates 1 1
# Warren 1 1
# Stokes 1 1
# Caswell 1 1
# Rockingham 1 1
# Granville 1 1
# Person 1 1
# ... 86 Others 86 86
#
# Summary of Table Frequencies
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1 1 1 1 1 1
# ----------------------------------------------------------------------------------------------------
# FIPS (character):
# Statistics
# N Ndist
# 100 100
# Table
# Freq Perc
# 37009 1 1
# 37005 1 1
# 37171 1 1
# 37053 1 1
# 37131 1 1
# 37091 1 1
# 37029 1 1
# 37073 1 1
# 37185 1 1
# 37169 1 1
# 37033 1 1
# 37157 1 1
# 37077 1 1
# 37145 1 1
# ... 86 Others 86 86
#
# Summary of Table Frequencies
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1 1 1 1 1 1
# ----------------------------------------------------------------------------------------------------
# FIPSNO (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 37100 58.02 37001 37199 -0 1.8
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 37002.98 37010.9 37050.5 37100 37149.5 37189.1 37197.02
# ----------------------------------------------------------------------------------------------------
# CRESS_ID (integer):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 50.5 29.01 1 100 0 1.8
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 1.99 5.95 25.75 50.5 75.25 95.05 99.01
# ----------------------------------------------------------------------------------------------------
# BIR74 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 3299.62 3848.17 248 21588 2.79 11.79
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 283.64 419.75 1077 2180.5 3936 11193 20378.22
# ----------------------------------------------------------------------------------------------------
# SID74 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 23 6.67 7.78 0 44 2.44 10.28
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 0 0 2 4 8.25 18.25 38.06
# ----------------------------------------------------------------------------------------------------
# NWBIR74 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 93 1050.81 1432.91 1 8027 2.83 11.84
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 1 9.95 190 697.5 1168.5 3942.9 7052.84
# ----------------------------------------------------------------------------------------------------
# BIR79 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 100 4223.92 5179.46 319 30757 2.99 13.1
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 349.69 539.3 1336.25 2636 4889 14707.45 26413.87
# ----------------------------------------------------------------------------------------------------
# SID79 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 28 8.36 9.43 0 57 2.28 9.88
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 0 0 2 5 10.25 26 38.19
# ----------------------------------------------------------------------------------------------------
# NWBIR79 (numeric):
# Statistics
# N Ndist Mean SD Min Max Skew Kurt
# 100 98 1352.81 1976 3 11631 3.18 14.45
# Quantiles
# 1% 5% 25% 50% 75% 95% 99%
# 3.99 11.9 250.5 874.5 1406.75 5090.5 10624.17
# ----------------------------------------------------------------------------------------------------
We can now also easily select columns from the sf data frame without having to worry about taking along ‘geometry’:
# Selecting a sequence of columns
fselect(nc, AREA, NAME:FIPSNO)
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
# Same using standard evaluation (gv is a shorthand for get_vars())
gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO"))
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
The same applies to subsetting rows (and columns):
# A fast and enhanced version of base::subset, now also supporting sf data
fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO)
# Simple feature collection with 44 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
# 2 0.153 Northampton 37131 37131 MULTIPOLYGON (((-77.21767 3...
# 3 0.153 Rockingham 37157 37157 MULTIPOLYGON (((-79.53051 3...
# A fast version of `[` (where i is used and optionally j)
ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO"))
# Simple feature collection with 10 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA NAME FIPS FIPSNO geometry
# 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3...
This is significantly faster than using [
,
dplyr::select
or dplyr::filter
:
library(microbenchmark)
library(dplyr)
# Selecting columns
microbenchmark(collapse = fselect(nc, AREA, NAME:FIPSNO),
dplyr = select(nc, AREA, NAME:FIPSNO),
collapse2 = gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO")),
sf = nc[c("AREA", "NAME", "FIPS", "FIPSNO")])
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 2.952 3.8745 5.29843 5.3505 6.2320 15.539 100
# dplyr 721.559 740.4190 851.35967 784.3300 835.5185 4868.258 100
# collapse2 2.132 2.8905 3.94707 3.9770 4.5510 7.585 100
# sf 98.359 107.4200 115.37564 114.4925 122.6515 139.892 100
# Subsetting
microbenchmark(collapse = fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO),
dplyr = select(nc, AREA, NAME:FIPSNO) %>% filter(AREA > fmean(AREA)),
collapse2 = ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO")),
sf = nc[1:10, c("AREA", "NAME", "FIPS", "FIPSNO")])
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 8.856 10.9265 15.29628 14.186 19.1470 29.315 100
# dplyr 3081.437 3126.7420 3396.14029 3166.061 3219.2585 11497.876 100
# collapse2 2.337 2.9110 4.84292 3.731 6.3345 15.662 100
# sf 189.215 203.7495 220.10891 223.286 233.0645 338.373 100
It needs to be noted that at this point collapse functions don’t subset the ‘agr’ attribute on selecting columns, which (if specified) relates columns (attributes) to the geometry, and also don’t modify the ‘bbox’ attribute giving the overall boundaries of a set of geometries when subsetting the sf data frame. Keeping the full ‘agr’ attribute is not problematic for all practical purposes, but not changing ‘bbox’ upon subsetting may lead to too large margins when plotting the geometries of a subsetted sf data frame.
The flexibility and speed of collap
for aggregation can
be used on sf data frames. A separate method for sf
objects was not considered necessary as one can simply aggregate the
geometry column using st_union
:
# Aggregating by variable SID74 using the median for numeric and the mode for categorical columns
collap(nc, ~ SID74, custom = list(fmedian = is.numeric,
fmode = is.character,
st_union = "geometry")) # or use is.list to fetch the geometry
# Simple feature collection with 23 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 SID74 NWBIR74 BIR79
# 1 0.0780 1.3070 1950.0 1950.0 Alleghany 37005 37073 37.0 487 0 0 40.0 594.0
# 2 0.0810 1.2880 1887.0 1887.0 Ashe 37009 37137 69.0 751 1 1 148.0 899.0
# 3 0.1225 1.6435 1959.5 1959.5 Caswell 37033 37078 39.5 1271 2 2 382.5 1676.5
# SID79 NWBIR79 geometry
# 1 1 45 MULTIPOLYGON (((-81.17667 3...
# 2 1 176 MULTIPOLYGON (((-75.94193 3...
# 3 2 452 MULTIPOLYGON (((-79.2585 36...
sf data frames can also be grouped and then aggregated using
fsummarise
:
nc %>% fgroup_by(SID74)
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry
# 1 19 MULTIPOLYGON (((-81.47276 3...
# 2 12 MULTIPOLYGON (((-81.23989 3...
# 3 260 MULTIPOLYGON (((-80.45634 3...
#
# Grouped by: SID74 [23 | 4 (4) 1-13]
nc %>% fgroup_by(SID74) %>%
fsummarise(AREA_Ag = fsum(AREA),
Perimeter_Ag = fmedian(PERIMETER),
geometry = st_union(geometry))
# Simple feature collection with 23 features and 3 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# SID74 AREA_Ag Perimeter_Ag geometry
# 1 0 1.103 1.3070 MULTIPOLYGON (((-81.17667 3...
# 2 1 0.914 1.2880 MULTIPOLYGON (((-75.94193 3...
# 3 2 1.047 1.6435 MULTIPOLYGON (((-79.2585 36...
It needs to be noted here that typically most of the time in
aggregation is consumed by st_union
so that the speed of
collapse does not really become visible on most datasets. One
exception is spatial panel data when aggregating over the time-dimension
with identical geometries for each id over time. Such panels can quickly
be aggregated using ffirst
or flast
to
aggregate the geometry:
# Creating a panel-dataset by simply duplicating nc for 2 different years
pnc <- list(`2000` = nc, `2001` = nc) %>% unlist2d("Year") %>% copyMostAttrib(nc)
pnc
# Simple feature collection with 200 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# Year AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79
# 1 2000 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364
# 2 2000 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542
# 3 2000 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616
# SID79 NWBIR79 geometry
# 1 0 19 MULTIPOLYGON (((-81.47276 3...
# 2 3 12 MULTIPOLYGON (((-81.23989 3...
# 3 6 260 MULTIPOLYGON (((-80.45634 3...
# Aggregating by NAME, using the last value for all categorical data
collap(pnc, ~ NAME, fmedian, catFUN = flast, cols = -1L)
# Simple feature collection with 100 features and 15 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79
# 1 0.111 1.392 1904 1904 Alamance Alamance 37001 37001 1 4672 13 1243 5767
# 2 0.066 1.070 1950 1950 Alexander Alexander 37003 37003 2 1333 0 128 1683
# 3 0.061 1.231 1827 1827 Alleghany Alleghany 37005 37005 3 487 0 10 542
# SID79 NWBIR79 geometry
# 1 11 1397 MULTIPOLYGON (((-79.24619 3...
# 2 2 150 MULTIPOLYGON (((-81.10889 3...
# 3 3 12 MULTIPOLYGON (((-81.23989 3...
# Using fsummarise to aggregate just two variables and the geometry
pnc_ag <- pnc %>% fgroup_by(NAME) %>%
fsummarise(AREA_Ag = fsum(AREA),
Perimeter_Ag = fmedian(PERIMETER),
geometry = flast(geometry))
# The geometry is still valid...
plot(slt(pnc_ag, AREA_Ag))
Functions funique
and roworder(v)
work on
sf data frames and ignore the geometry column for determining the unique
values / order of rows. rsplit
can be used to (recursively)
split an sf data frame into multiple chunks.
# Splitting by SID74
rsplit(nc, ~ SID74) %>% head(2)
# $`0`
# Simple feature collection with 13 features and 13 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 NWBIR74 BIR79 SID79 NWBIR79
# 1 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 10 542 3 12
# 2 0.062 1.547 1834 1834 Camden 37029 37029 15 286 115 350 2 139
# 3 0.091 1.284 1835 1835 Gates 37073 37073 37 420 254 594 2 371
# geometry
# 1 MULTIPOLYGON (((-81.23989 3...
# 2 MULTIPOLYGON (((-76.00897 3...
# 3 MULTIPOLYGON (((-76.56251 3...
#
# $`1`
# Simple feature collection with 11 features and 13 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 NWBIR74 BIR79 SID79 NWBIR79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 10 1364 0 19
# 2 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 123 830 2 145
# 3 0.124 1.428 1837 1837 Stokes 37169 37169 85 1612 160 2038 5 176
# geometry
# 1 MULTIPOLYGON (((-81.47276 3...
# 2 MULTIPOLYGON (((-76.00897 3...
# 3 MULTIPOLYGON (((-80.02567 3...
It should be noted here that the default in rsplit
for
data frames is simplify = TRUE
, which for a single LHS
variable would just split the column-vector. This does not apply to
sf data frames as the ‘geometry’ column is always selected as
well.
# Only splitting Area
rsplit(nc, AREA ~ SID74) %>% head(1)
# $`0`
# Simple feature collection with 13 features and 1 field
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA geometry
# 1 0.061 MULTIPOLYGON (((-81.23989 3...
# 2 0.062 MULTIPOLYGON (((-76.00897 3...
# 3 0.091 MULTIPOLYGON (((-76.56251 3...
# For data frames the default simplify = TRUE drops the data frame structure
rsplit(qDF(nc), AREA ~ SID74) %>% head(1)
# $`0`
# [1] 0.061 0.062 0.091 0.064 0.059 0.080 0.066 0.099 0.094 0.078 0.131 0.167 0.051
For transforming and computing columns, ftransform(v)
and settransform(v)
apply as to any other data frame.
ftransform(nc, scaled_AREA = fscale(AREA),
gsum_AREA = fsum(AREA, SID74, TRA = "replace_fill"))
# Simple feature collection with 100 features and 16 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry scaled_AREA gsum_AREA
# 1 19 MULTIPOLYGON (((-81.47276 3... -0.2491860 0.914
# 2 12 MULTIPOLYGON (((-81.23989 3... -1.3264176 1.103
# 3 260 MULTIPOLYGON (((-80.45634 3... 0.3402426 1.380
# settransform materializes the change, same as nc <- ftransform(nc, ...)
settransform(nc, scaled_AREA = fscale(AREA),
gsum_AREA = fsum(AREA, SID74, TRA = "replace_fill"))
nc
# Simple feature collection with 100 features and 16 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
# 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0
# 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3
# 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6
# NWBIR79 geometry scaled_AREA gsum_AREA
# 1 19 MULTIPOLYGON (((-81.47276 3... -0.2491860 0.914
# 2 12 MULTIPOLYGON (((-81.23989 3... -1.3264176 1.103
# 3 260 MULTIPOLYGON (((-80.45634 3... 0.3402426 1.380
Special attention to sf data frames is afforded by
fcompute
, which can be used to compute new columns dropping
existing ones - except for the geometry column and any columns selected
through the keep
argument.
fcompute(nc, scaled_AREA = fscale(AREA),
gsum_AREA = fsum(AREA, SID74, TRA = "replace_fill"),
keep = .c(AREA, SID74))
# Simple feature collection with 100 features and 4 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# AREA SID74 scaled_AREA gsum_AREA geometry
# 1 0.114 1 -0.2491860 0.914 MULTIPOLYGON (((-81.47276 3...
# 2 0.061 0 -1.3264176 1.103 MULTIPOLYGON (((-81.23989 3...
# 3 0.143 5 0.3402426 1.380 MULTIPOLYGON (((-80.45634 3...
The quick converters qDF
, qDT
and
qTBL
can be used to very efficiently convert sf
data frames to standard data frames, data.table’s or
tibbles, and the result can be converted back to the original
sf data frame using setAttrib
,
copyAttrib
or copyMostAttrib
.
library(data.table)
qDT(nc)[, list(roll_AREA = frollmean(AREA, 2), geometry), by = SID74] %>% copyMostAttrib(nc)
# Simple feature collection with 100 features and 2 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
# First 3 features:
# SID74 roll_AREA geometry
# 1 1 NA MULTIPOLYGON (((-81.47276 3...
# 2 1 0.092 MULTIPOLYGON (((-76.00897 3...
# 3 1 0.097 MULTIPOLYGON (((-80.02567 3...