# Fast (Grouped, Weighted) Statistical Functions for Matrix-Like Objects

`fast-statistical-functions.Rd`

With `fsum`

, `fprod`

, `fmean`

, `fmedian`

, `fmode`

, `fvar`

, `fsd`

, `fmin`

, `fmax`

, `fnth`

, `ffirst`

, `flast`

, `fnobs`

and `fndistinct`

, *collapse* presents a coherent set of extremely fast and flexible statistical functions (S3 generics) to perform column-wise, grouped and weighted computations on vectors, matrices and data frames, with special support for grouped data frames / tibbles (*dplyr*) and *data.table*'s.

## Usage

```
## All functions (FUN) follow a common syntax in 4 methods:
FUN(x, ...)
## Default S3 method:
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'matrix'
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'data.frame'
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'grouped_df'
FUN(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = FALSE, keep.group_vars = TRUE,
[keep.w = TRUE,] [stub = TRUE,] [nthreads = 1L,] ...)
```

## Arguments

`x` | a vector, matrix, data frame or grouped data frame (class 'grouped_df'). | |

`g` | a factor, `GRP` object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a `GRP` object) used to group `x` . | |

`w` | a numeric vector of (non-negative) weights, may contain missing values. Supported by `fsum` , `fprod` , `fmean` , `fmedian` , `fnth` , `fvar` , `fsd` and `fmode` . | |

`TRA` | an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See `TRA` . | |

`na.rm` | logical. Skip missing values in `x` . Defaults to `TRUE` in all functions and implemented at very little computational cost. Not available for `fnobs` . | |

`use.g.names` | logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. | |

`nthreads` | integer. The number of threads to utilize. Supported by `fsum` , `fmean` , `fmedian` , `fnth` , `fmode` and `fndistinct` . | |

`drop` | matrix and data.frame methods: Logical. `TRUE` drops dimensions and returns an atomic vector if `g = NULL` and `TRA = NULL` . | |

`keep.group_vars` | grouped_df method: Logical. `FALSE` removes grouping variables after computation. By default grouping variables are added, even if not present in the grouped_df. | |

`keep.w` | grouped_df method: Logical. `TRUE` (default) also aggregates weights and saves them in a column, `FALSE` removes weighting variable after computation (if contained in `grouped_df` ). | |

`stub` | grouped_df method: Character. If `keep.w = TRUE` and `stub = TRUE` (default), the aggregated weights column is prefixed by the name of the aggregation function (mostly `"sum."` ). Users can specify a different prefix through this argument, or set it to `FALSE` to avoid prefixing. | |

`...` | arguments to be passed to or from other methods. If `TRA` is used, passing `set = TRUE` will transform data by reference and return the result invisibly (except for the grouped_df method which always returns visible output). |

## Value

`x`

suitably aggregated or transformed. Data frame column-attributes and overall attributes are generally preserved if the output is of the same data type.

## Related Functionality

Panel-decomposed (i.e. between and within) statistics as well as grouped and weighted skewness and kurtosis are implemented in

`qsu`

.The vector-valued functions and operators

`fcumsum`

,`fscale/STD`

,`fbetween/B`

,`fhdbetween/HDB`

,`fwithin/W`

,`fhdwithin/HDW`

,`flag/L/F`

,`fdiff/D/Dlog`

and`fgrowth/G`

are grouped under Data Transformations and Time Series and Panel Series. These functions also support indexed data (*plm*).

## Examples

```
## default vector method
mpg <- mtcars$mpg
fsum(mpg) # Simple sum
fsum(mpg, TRA = "/") # Simple transformation: divide all values by the sum
fsum(mpg, mtcars$cyl) # Grouped sum
fmean(mpg, mtcars$cyl) # Grouped mean
fmean(mpg, w = mtcars$hp) # Weighted mean, weighted by hp
fmean(mpg, mtcars$cyl, mtcars$hp) # Grouped mean, weighted by hp
fsum(mpg, mtcars$cyl, TRA = "/") # Proportions / division by group sums
fmean(mpg, mtcars$cyl, mtcars$hp, # Subtract weighted group means, see also ?fwithin
TRA = "-")
## data.frame method
fsum(mtcars)
fsum(mtcars, TRA = "%") # This computes percentages
fsum(mtcars, mtcars[c(2,8:9)]) # Grouped column sum
g <- GRP(mtcars, ~ cyl + vs + am) # Here precomputing the groups!
fsum(mtcars, g) # Faster !!
fmean(mtcars, g, mtcars$hp)
fmean(mtcars, g, mtcars$hp, "-") # Demeaning by weighted group means..
fmean(fgroup_by(mtcars, cyl, vs, am), hp, "-") # Another way of doing it..
fmode(wlddev, drop = FALSE) # Compute statistical modes of variables in this data
fmode(wlddev, wlddev$income) # Grouped statistical modes ..
## matrix method
m <- qM(mtcars)
fsum(m)
fsum(m, g) # ..
\donttest{
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% select(mpg,carb) %>% fsum()
mtcars %>% fgroup_by(cyl,vs,am) %>% fselect(mpg,carb) %>% fsum() # equivalent and faster !!
mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(TRA = "%")
mtcars %>% fgroup_by(cyl,vs,am) %>% fmean(hp) # weighted grouped mean, save sum of weights
mtcars %>% fgroup_by(cyl,vs,am) %>% fmean(hp, keep.group_vars = FALSE)
}
```

## Benchmark

```
## This compares fsum with data.table (2 threads) and base::rowsum
# Starting with small data
mtcDT <- qDT(mtcars)
f <- qF(mtcars$cyl)
library(microbenchmark)
microbenchmark(mtcDT[, lapply(.SD, sum), by = f],
rowsum(mtcDT, f, reorder = FALSE),
fsum(mtcDT, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld
mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c
rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b
fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# Now larger data
tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs
f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups
microbenchmark(tdata[, lapply(.SD, sum), by = f],
rowsum(tdata, f, reorder = FALSE),
fsum(tdata, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld
tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c
rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b
fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
```