`fsum.Rd`

`fsum`

is a generic function that computes the (column-wise) sum of all values in `x`

, (optionally) grouped by `g`

and/or weighted by `w`

(e.g. to calculate survey totals). The `TRA`

argument can further be used to transform `x`

using its (grouped, weighted) sum.

```
fsum(x, ...)
# S3 method for default
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
# S3 method for matrix
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
# S3 method for data.frame
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
# S3 method for grouped_df
fsum(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE,
keep.w = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
```

- x
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').

- g
a factor,

`GRP`

object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a`GRP`

object) used to group`x`

.- w
a numeric vector of (non-negative) weights, may contain missing values.

- TRA
an integer or quoted operator indicating the transformation to perform: 0 - "replace_NA" | 1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See

`TRA`

.- na.rm
logical. Skip missing values in

`x`

. Defaults to`TRUE`

and implemented at very little computational cost. If`na.rm = FALSE`

a`NA`

is returned when encountered.- use.g.names
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for

*data.table*'s.- fill
logical. Initialize result with

`0`

instead of`NA`

when`na.rm = TRUE`

e.g.`fsum(NA, fill = TRUE)`

returns`0`

instead of`NA`

.- nthreads
integer. The number of threads to utilize. See Details.

- drop
*matrix and data.frame method:*Logical.`TRUE`

drops dimensions and returns an atomic vector if`g = NULL`

and`TRA = NULL`

.- keep.group_vars
*grouped_df method:*Logical.`FALSE`

removes grouping variables after computation.- keep.w
*grouped_df method:*Logical. Retain summed weighting variable after computation (if contained in`grouped_df`

).- ...
arguments to be passed to or from other methods. If

`TRA`

is used, passing`set = TRUE`

will transform data by reference and return the result invisibly.

Missing-value removal as controlled by the `na.rm`

argument is done very efficiently by simply skipping them in the computation (thus setting `na.rm = FALSE`

on data with no missing values doesn't give extra speed). Large performance gains can nevertheless be achieved in the presence of missing values if `na.rm = FALSE`

, since then the corresponding computation is terminated once a `NA`

is encountered and `NA`

is returned (unlike `sum`

which just runs through without any checks).

The weighted sum (e.g. survey total) is computed as `sum(x * w)`

, but in one pass and about twice as efficient. If `na.rm = TRUE`

, missing values will be removed from both `x`

and `w`

i.e. utilizing only `x[complete.cases(x,w)]`

and `w[complete.cases(x,w)]`

.

This all seamlessly generalizes to grouped computations, which are performed in a single pass (without splitting the data) and are therefore extremely fast. See Benchmark and Examples below.

When applied to data frames with groups or `drop = FALSE`

, `fsum`

preserves all column attributes. The attributes of the data frame itself are also preserved.

Since v1.6.0 `fsum`

explicitly supports integers. Integers are summed using the long long type in C which is bounded at +-9,223,372,036,854,775,807 (so ~4.3 billion times greater than the minimum/maximum R integer bounded at +-2,147,483,647). If the value of the sum is outside +-2,147,483,647, a double containing the result is returned, otherwise an integer is returned. With groups, an integer overflow error is provided if the sum in any group is outside +-2,147,483,647. Data should be coerced to double beforehand in such cases.

Multithreading, added in v1.8.0, applies at the column-level unless `g = NULL`

and `nthreads > NCOL(x)`

. Parallelism over groups is not available because sums are computed simultaneously within each group. `nthreads = 1L`

uses a serial version of the code, not parallel code running on one thread. This serial code is always used with less than 100,000 obs (`length(x) < 100000`

for vectors and matrices), because parallel execution itself has some overhead.

The (`w`

weighted) sum of `x`

, grouped by `g`

, or (if `TRA`

is used) `x`

transformed by its (grouped, weighted) sum.

```
## default vector method
mpg <- mtcars$mpg
fsum(mpg) # Simple sum
fsum(mpg, w = mtcars$hp) # Weighted sum (total): Weighted by hp
fsum(mpg, TRA = "%") # Simple transformation: obtain percentages of mpg
fsum(mpg, mtcars$cyl) # Grouped sum
fsum(mpg, mtcars$cyl, mtcars$hp) # Weighted grouped sum (total)
fsum(mpg, mtcars[c(2,8:9)]) # More groups..
g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed !
fsum(mpg, g)
fmean(mpg, g) == fsum(mpg, g) / fnobs(mpg, g)
fsum(mpg, g, TRA = "%") # Percentages by group
## data.frame method
fsum(mtcars)
fsum(mtcars, TRA = "%")
fsum(mtcars, g)
fsum(mtcars, g, TRA = "%")
## matrix method
m <- qM(mtcars)
fsum(m)
fsum(m, TRA = "%")
fsum(m, g)
fsum(m, g, TRA = "%")
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% fsum(hp) # Weighted grouped sum (total)
mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(hp) # Equivalent and faster !!
mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(TRA = "%")
mtcars %>% fgroup_by(cyl,vs,am) %>% fselect(mpg) %>% fsum()
```

```
## This compares fsum with data.table (2 threads) and base::rowsum
# Starting with small data
mtcDT <- qDT(mtcars)
f <- qF(mtcars$cyl)
library(microbenchmark)
microbenchmark(mtcDT[, lapply(.SD, sum), by = f],
rowsum(mtcDT, f, reorder = FALSE),
fsum(mtcDT, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld
mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c
rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b
fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# Now larger data
tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs
f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups
microbenchmark(tdata[, lapply(.SD, sum), by = f],
rowsum(tdata, f, reorder = FALSE),
fsum(tdata, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld
tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c
rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b
fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
```