BY is an S3 generic that efficiently applies functions over vectors or matrix- and data frame columns by groups. Similar to dapply it seeks to retain the structure and attributes of the data, but can also output to various standard formats. A simple parallelism is also available.

BY(x, ...)

# S3 method for default
BY(x, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "vector", "list"))

# S3 method for matrix
BY(x, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

# S3 method for data.frame
BY(x, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

# S3 method for grouped_df
BY(x, FUN, ..., use.g.names = FALSE, keep.group_vars = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

Arguments

x

a atomic vector, matrix, data frame or alike object.

g

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

FUN

a function, can be scalar- or vector-valued.

...

further arguments to FUN.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

sort

logical. Sort the groups? Internally passed to GRP or qF, and only effective if g is not already a factor or GRP object.

expand.wide

logical. If FUN is a vector-valued function returning a vector of fixed length > 1 (such as the quantile function), expand.wide can be used to return the result in a wider format (instead of stacking the resulting vectors of fixed length above each other in each output column).

parallel

logical. TRUE implements simple parallel execution by internally calling mclapply instead of lapply.

mc.cores

integer. Argument to mclapply indicating the number of cores to use for parallel execution. Can use detectCores() to select all available cores.

return

an integer or string indicating the type of object to return. The default 1 - "same" returns the same object type (i.e. class and other attributes are retained, just the names for the dimensions are adjusted). 2 - "matrix" always returns the output as matrix, 3 - "data.frame" always returns a data frame and 4 - "list" returns the raw (uncombined) output. Note: 4 - "list" works together with expand.wide to return a list of matrices.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

Details

BY is a frugal re-implementation of the Split-Apply-Combine computing paradigm. It is generally faster than tapply, by, aggregate and plyr, and preserves data attributes just like dapply.

It is however principally a wrapper around lapply(split(x, g), FUN, ...), that strongly optimizes on attribute checking compared to base R functions. For more details look at the documentation for dapply which works very similar (apart from the splitting performed in BY). For larger tasks requiring split-apply-combine computing on data frames use dplyr, data.table, or try to work with the Fast Statistical Functions.

BY is used internally in collap for functions that are not Fast Statistical Functions.

Value

X where FUN was applied to every column split by g.

See also

Examples

v <- iris$Sepal.Length # A numeric vector f <- iris$Species # A factor. Vectors/lists will internally be converted to factor ## default vector method BY(v, f, sum) # Sum by species
#> setosa versicolor virginica #> 250.3 296.8 329.4
head(BY(v, f, scale)) # Scale by species (please use fscale instead)
#> setosa1 setosa2 setosa3 setosa4 setosa5 setosa6 #> 0.26667447 -0.30071802 -0.86811050 -1.15180675 -0.01702177 1.11776320
head(BY(v, f, scale, use.g.names = FALSE)) # Omitting auto-generated names
#> [1] 0.26667447 -0.30071802 -0.86811050 -1.15180675 -0.01702177 1.11776320
BY(v, f, quantile) # Species quantiles: by default stacked
#> setosa.0% setosa.25% setosa.50% setosa.75% setosa.100% #> 4.300 4.800 5.000 5.200 5.800 #> versicolor.0% versicolor.25% versicolor.50% versicolor.75% versicolor.100% #> 4.900 5.600 5.900 6.300 7.000 #> virginica.0% virginica.25% virginica.50% virginica.75% virginica.100% #> 4.900 6.225 6.500 6.900 7.900
BY(v, f, quantile, expand.wide = TRUE) # Wide format
#> 0% 25% 50% 75% 100% #> setosa 4.3 4.800 5.0 5.2 5.8 #> versicolor 4.9 5.600 5.9 6.3 7.0 #> virginica 4.9 6.225 6.5 6.9 7.9
## matrix method m <- qM(num_vars(iris)) BY(m, f, sum) # Also return as matrix
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa 250.3 171.4 73.1 12.3 #> versicolor 296.8 138.5 213.0 66.3 #> virginica 329.4 148.7 277.6 101.3
BY(m, f, sum, return = "data.frame") # Return as data.frame.. also works for computations below
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa 250.3 171.4 73.1 12.3 #> versicolor 296.8 138.5 213.0 66.3 #> virginica 329.4 148.7 277.6 101.3
head(BY(m, f, scale))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa1 0.26667447 0.1899414 -0.3570112 -0.4364923 #> setosa2 -0.30071802 -1.1290958 -0.3570112 -0.4364923 #> setosa3 -0.86811050 -0.6014810 -0.9328358 -0.4364923 #> setosa4 -1.15180675 -0.8652884 0.2188133 -0.4364923 #> setosa5 -0.01702177 0.4537488 -0.3570112 -0.4364923 #> setosa6 1.11776320 1.2451711 1.3704625 1.4613004
head(BY(m, f, scale, use.g.names = FALSE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> [1,] 0.26667447 0.1899414 -0.3570112 -0.4364923 #> [2,] -0.30071802 -1.1290958 -0.3570112 -0.4364923 #> [3,] -0.86811050 -0.6014810 -0.9328358 -0.4364923 #> [4,] -1.15180675 -0.8652884 0.2188133 -0.4364923 #> [5,] -0.01702177 0.4537488 -0.3570112 -0.4364923 #> [6,] 1.11776320 1.2451711 1.3704625 1.4613004
BY(m, f, quantile)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa.0% 4.300 2.300 1.000 0.1 #> setosa.25% 4.800 3.200 1.400 0.2 #> setosa.50% 5.000 3.400 1.500 0.2 #> setosa.75% 5.200 3.675 1.575 0.3 #> setosa.100% 5.800 4.400 1.900 0.6 #> versicolor.0% 4.900 2.000 3.000 1.0 #> versicolor.25% 5.600 2.525 4.000 1.2 #> versicolor.50% 5.900 2.800 4.350 1.3 #> versicolor.75% 6.300 3.000 4.600 1.5 #> versicolor.100% 7.000 3.400 5.100 1.8 #> virginica.0% 4.900 2.200 4.500 1.4 #> virginica.25% 6.225 2.800 5.100 1.8 #> virginica.50% 6.500 3.000 5.550 2.0 #> virginica.75% 6.900 3.175 5.875 2.3 #> virginica.100% 7.900 3.800 6.900 2.5
BY(m, f, quantile, expand.wide = TRUE)
#> Sepal.Length.0% Sepal.Length.25% Sepal.Length.50% Sepal.Length.75% #> setosa 4.3 4.800 5.0 5.2 #> versicolor 4.9 5.600 5.9 6.3 #> virginica 4.9 6.225 6.5 6.9 #> Sepal.Length.100% Sepal.Width.0% Sepal.Width.25% Sepal.Width.50% #> setosa 5.8 2.3 3.200 3.4 #> versicolor 7.0 2.0 2.525 2.8 #> virginica 7.9 2.2 2.800 3.0 #> Sepal.Width.75% Sepal.Width.100% Petal.Length.0% Petal.Length.25% #> setosa 3.675 4.4 1.0 1.4 #> versicolor 3.000 3.4 3.0 4.0 #> virginica 3.175 3.8 4.5 5.1 #> Petal.Length.50% Petal.Length.75% Petal.Length.100% Petal.Width.0% #> setosa 1.50 1.575 1.9 0.1 #> versicolor 4.35 4.600 5.1 1.0 #> virginica 5.55 5.875 6.9 1.4 #> Petal.Width.25% Petal.Width.50% Petal.Width.75% Petal.Width.100% #> setosa 0.2 0.2 0.3 0.6 #> versicolor 1.2 1.3 1.5 1.8 #> virginica 1.8 2.0 2.3 2.5
BY(m, f, quantile, expand.wide = TRUE, # Return as list of matrices return = "list")
#> $Sepal.Length #> 0% 25% 50% 75% 100% #> setosa 4.3 4.800 5.0 5.2 5.8 #> versicolor 4.9 5.600 5.9 6.3 7.0 #> virginica 4.9 6.225 6.5 6.9 7.9 #> #> $Sepal.Width #> 0% 25% 50% 75% 100% #> setosa 2.3 3.200 3.4 3.675 4.4 #> versicolor 2.0 2.525 2.8 3.000 3.4 #> virginica 2.2 2.800 3.0 3.175 3.8 #> #> $Petal.Length #> 0% 25% 50% 75% 100% #> setosa 1.0 1.4 1.50 1.575 1.9 #> versicolor 3.0 4.0 4.35 4.600 5.1 #> virginica 4.5 5.1 5.55 5.875 6.9 #> #> $Petal.Width #> 0% 25% 50% 75% 100% #> setosa 0.1 0.2 0.2 0.3 0.6 #> versicolor 1.0 1.2 1.3 1.5 1.8 #> virginica 1.4 1.8 2.0 2.3 2.5 #>
## data.frame method BY(num_vars(iris), f, sum) # Also returns a data.fram
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa 250.3 171.4 73.1 12.3 #> versicolor 296.8 138.5 213.0 66.3 #> virginica 329.4 148.7 277.6 101.3
BY(num_vars(iris), f, sum, return = 2) # Return as matrix.. also works for computations below
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa 250.3 171.4 73.1 12.3 #> versicolor 296.8 138.5 213.0 66.3 #> virginica 329.4 148.7 277.6 101.3
head(BY(num_vars(iris), f, scale))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa1 0.26667447 0.1899414 -0.3570112 -0.4364923 #> setosa2 -0.30071802 -1.1290958 -0.3570112 -0.4364923 #> setosa3 -0.86811050 -0.6014810 -0.9328358 -0.4364923 #> setosa4 -1.15180675 -0.8652884 0.2188133 -0.4364923 #> setosa5 -0.01702177 0.4537488 -0.3570112 -0.4364923 #> setosa6 1.11776320 1.2451711 1.3704625 1.4613004
head(BY(num_vars(iris), f, scale, use.g.names = FALSE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> 1 0.26667447 0.1899414 -0.3570112 -0.4364923 #> 2 -0.30071802 -1.1290958 -0.3570112 -0.4364923 #> 3 -0.86811050 -0.6014810 -0.9328358 -0.4364923 #> 4 -1.15180675 -0.8652884 0.2188133 -0.4364923 #> 5 -0.01702177 0.4537488 -0.3570112 -0.4364923 #> 6 1.11776320 1.2451711 1.3704625 1.4613004
BY(num_vars(iris), f, quantile)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa.0% 4.300 2.300 1.000 0.1 #> setosa.25% 4.800 3.200 1.400 0.2 #> setosa.50% 5.000 3.400 1.500 0.2 #> setosa.75% 5.200 3.675 1.575 0.3 #> setosa.100% 5.800 4.400 1.900 0.6 #> versicolor.0% 4.900 2.000 3.000 1.0 #> versicolor.25% 5.600 2.525 4.000 1.2 #> versicolor.50% 5.900 2.800 4.350 1.3 #> versicolor.75% 6.300 3.000 4.600 1.5 #> versicolor.100% 7.000 3.400 5.100 1.8 #> virginica.0% 4.900 2.200 4.500 1.4 #> virginica.25% 6.225 2.800 5.100 1.8 #> virginica.50% 6.500 3.000 5.550 2.0 #> virginica.75% 6.900 3.175 5.875 2.3 #> virginica.100% 7.900 3.800 6.900 2.5
BY(num_vars(iris), f, quantile, expand.wide = TRUE)
#> Sepal.Length.0% Sepal.Length.25% Sepal.Length.50% Sepal.Length.75% #> setosa 4.3 4.800 5.0 5.2 #> versicolor 4.9 5.600 5.9 6.3 #> virginica 4.9 6.225 6.5 6.9 #> Sepal.Length.100% Sepal.Width.0% Sepal.Width.25% Sepal.Width.50% #> setosa 5.8 2.3 3.200 3.4 #> versicolor 7.0 2.0 2.525 2.8 #> virginica 7.9 2.2 2.800 3.0 #> Sepal.Width.75% Sepal.Width.100% Petal.Length.0% Petal.Length.25% #> setosa 3.675 4.4 1.0 1.4 #> versicolor 3.000 3.4 3.0 4.0 #> virginica 3.175 3.8 4.5 5.1 #> Petal.Length.50% Petal.Length.75% Petal.Length.100% Petal.Width.0% #> setosa 1.50 1.575 1.9 0.1 #> versicolor 4.35 4.600 5.1 1.0 #> virginica 5.55 5.875 6.9 1.4 #> Petal.Width.25% Petal.Width.50% Petal.Width.75% Petal.Width.100% #> setosa 0.2 0.2 0.3 0.6 #> versicolor 1.2 1.3 1.5 1.8 #> virginica 1.8 2.0 2.3 2.5
BY(num_vars(iris), f, quantile, # Return as list of matrices expand.wide = TRUE, return = "list")
#> $Sepal.Length #> 0% 25% 50% 75% 100% #> setosa 4.3 4.800 5.0 5.2 5.8 #> versicolor 4.9 5.600 5.9 6.3 7.0 #> virginica 4.9 6.225 6.5 6.9 7.9 #> #> $Sepal.Width #> 0% 25% 50% 75% 100% #> setosa 2.3 3.200 3.4 3.675 4.4 #> versicolor 2.0 2.525 2.8 3.000 3.4 #> virginica 2.2 2.800 3.0 3.175 3.8 #> #> $Petal.Length #> 0% 25% 50% 75% 100% #> setosa 1.0 1.4 1.50 1.575 1.9 #> versicolor 3.0 4.0 4.35 4.600 5.1 #> virginica 4.5 5.1 5.55 5.875 6.9 #> #> $Petal.Width #> 0% 25% 50% 75% 100% #> setosa 0.1 0.2 0.2 0.3 0.6 #> versicolor 1.0 1.2 1.3 1.5 1.8 #> virginica 1.4 1.8 2.0 2.3 2.5 #>
## grouped data frame method (faster than dplyr only for small data) library(dplyr)
#> #> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:testthat’: #> #> matches
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
giris <- group_by(iris, Species) giris %>% BY(sum) # Compute sum
#> # A tibble: 3 x 5 #> Species Sepal.Length Sepal.Width Petal.Length Petal.Width #> * <fct> <dbl> <dbl> <dbl> <dbl> #> 1 setosa 250. 171. 73.1 12.3 #> 2 versicolor 297. 138. 213 66.3 #> 3 virginica 329. 149. 278. 101.
giris %>% BY(sum, use.g.names = TRUE, # Use row.names and keep.group_vars = FALSE) # remove 'Species' and groups attribute
#> # A tibble: 3 x 4 #> Sepal.Length Sepal.Width Petal.Length Petal.Width #> * <dbl> <dbl> <dbl> <dbl> #> 1 250. 171. 73.1 12.3 #> 2 297. 138. 213 66.3 #> 3 329. 149. 278. 101.
giris %>% BY(sum, return = "matrix") # Return matrix
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> [1,] 250.3 171.4 73.1 12.3 #> [2,] 296.8 138.5 213.0 66.3 #> [3,] 329.4 148.7 277.6 101.3
giris %>% BY(sum, return = "matrix", # Matrix with row.names use.g.names = TRUE)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> setosa 250.3 171.4 73.1 12.3 #> versicolor 296.8 138.5 213.0 66.3 #> virginica 329.4 148.7 277.6 101.3
giris %>% BY(quantile) # Compute quantiles (output is stacked)
#> # A tibble: 15 x 4 #> # Groups: Species [3] #> Sepal.Length Sepal.Width Petal.Length Petal.Width #> <dbl> <dbl> <dbl> <dbl> #> 1 4.3 2.3 1 0.1 #> 2 4.8 3.2 1.4 0.2 #> 3 5 3.4 1.5 0.2 #> 4 5.2 3.68 1.58 0.3 #> 5 5.8 4.4 1.9 0.6 #> 6 4.9 2 3 1 #> 7 5.6 2.52 4 1.2 #> 8 5.9 2.8 4.35 1.3 #> 9 6.3 3 4.6 1.5 #> 10 7 3.4 5.1 1.8 #> 11 4.9 2.2 4.5 1.4 #> 12 6.22 2.8 5.1 1.8 #> 13 6.5 3 5.55 2 #> 14 6.9 3.18 5.88 2.3 #> 15 7.9 3.8 6.9 2.5
giris %>% BY(quantile, # Much better, also keeps 'Species' expand.wide = TRUE)
#> # A tibble: 3 x 21 #> Species `Sepal.Length.0~ `Sepal.Length.2~ `Sepal.Length.5~ `Sepal.Length.7~ #> * <fct> <dbl> <dbl> <dbl> <dbl> #> 1 setosa 4.3 4.8 5 5.2 #> 2 versic~ 4.9 5.6 5.9 6.3 #> 3 virgin~ 4.9 6.22 6.5 6.9 #> # ... with 16 more variables: `Sepal.Length.100%` <dbl>, #> # `Sepal.Width.0%` <dbl>, `Sepal.Width.25%` <dbl>, `Sepal.Width.50%` <dbl>, #> # `Sepal.Width.75%` <dbl>, `Sepal.Width.100%` <dbl>, `Petal.Length.0%` <dbl>, #> # `Petal.Length.25%` <dbl>, `Petal.Length.50%` <dbl>, #> # `Petal.Length.75%` <dbl>, `Petal.Length.100%` <dbl>, #> # `Petal.Width.0%` <dbl>, `Petal.Width.25%` <dbl>, `Petal.Width.50%` <dbl>, #> # `Petal.Width.75%` <dbl>, `Petal.Width.100%` <dbl>