fvar and fsd are generic functions that compute the (column-wise) variance and standard deviation of x, (optionally) grouped by g and/or frequency-weighted by w. The TRA argument can further be used to transform x using its (grouped, weighted) variance/sd.

fvar(x, ...)
fsd(x, ...)

# S3 method for default
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, stable.algo = TRUE, ...)
# S3 method for default
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
    use.g.names = TRUE, stable.algo = TRUE, ...)

# S3 method for matrix
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, ...)
# S3 method for matrix
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
    use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, ...)

# S3 method for data.frame
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, ...)
# S3 method for data.frame
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
    use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, ...)

# S3 method for grouped_df
fvar(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE,
     stable.algo = TRUE, ...)
# S3 method for grouped_df
fsd(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
    use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE,
    stable.algo = TRUE, ...)

Arguments

x

a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').

g

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

w

a numeric vector of (non-negative) weights, may contain missing values.

TRA

an integer or quoted operator indicating the transformation to perform: 0 - "replace_NA" | 1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA.

na.rm

logical. Skip missing values in x. Defaults to TRUE and implemented at very little computational cost. If na.rm = FALSE a NA is returned when encountered.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

drop

matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

keep.w

grouped_df method: Logical. Retain summed weighting variable after computation (if contained in grouped_df).

stable.algo

logical. TRUE (default) use Welford's numerically stable online algorithm. FALSE implements a faster but numerically unstable one-pass method. See Details.

...

arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.

Details

Welford's online algorithm used by default to compute the variance is well described here (the section Weighted incremental algorithm also shows how the weighted variance is obtained by this algorithm).

If stable.algo = FALSE, the variance is computed in one-pass as (sum(x^2)-n*mean(x)^2)/(n-1), where sum(x^2) is the sum of squares from which the expected sum of squares n*mean(x)^2 is subtracted, normalized by n-1 (Bessel's correction). This is numerically unstable if sum(x^2) and n*mean(x)^2 are large numbers very close together, which will be the case for large n, large x-values and small variances (catastrophic cancellation occurs, leading to a loss of numeric precision). Numeric precision is however still maximized through the internal use of long doubles in C++, and the fast algorithm can be up to 4-times faster compared to Welford's method.

The weighted variance is computed with frequency weights as (sum(x^2*w)-sum(w)*weighted.mean(x,w)^2)/(sum(w)-1). If na.rm = TRUE, missing values will be removed from both x and w i.e. utilizing only x[complete.cases(x,w)] and w[complete.cases(x,w)].

For further computational detail see fsum.

Value

fvar returns the (w weighted) variance of x, grouped by g, or (if TRA is used) x transformed by its (grouped, weighted) variance. fsd computes the standard deviation of x in like manor.

References

Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics. 4 (3): 419-420. doi:10.2307/1266577.

Examples

## default vector method
fvar(mtcars$mpg)                            # Simple variance (all examples also hold for fvar!)
#> [1] 36.3241
fsd(mtcars$mpg)                             # Simple standard deviation
#> [1] 6.026948
fsd(mtcars$mpg, w = mtcars$hp)              # Weighted sd: Weighted by hp
#> [1] 5.150858
fsd(mtcars$mpg, TRA = "/")                  # Simple transformation: scaling (See also ?fscale)
#>  [1] 3.484351 3.484351 3.783009 3.550719 3.102731 3.003178 2.372677 4.048484
#>  [9] 3.783009 3.185692 2.953402 2.721112 2.870441 2.522006 1.725583 1.725583
#> [17] 2.439045 5.375855 5.044012 5.624737 3.567311 2.571783 2.522006 2.206755
#> [25] 3.185692 4.529656 4.313958 5.044012 2.621559 3.268653 2.488822 3.550719
fsd(mtcars$mpg, mtcars$cyl)                 # Grouped sd
#>        4        6        8 
#> 4.509828 1.453567 2.560048 
fsd(mtcars$mpg, mtcars$cyl, mtcars$hp)      # Grouped weighted sd
#>        4        6        8 
#> 4.250863 1.294689 2.390448 
fsd(mtcars$mpg, mtcars$cyl, TRA = "/")      # Scaling by group
#>  [1] 14.447218 14.447218  5.055626 14.722403  7.304550 12.452126  5.585833
#>  [8]  5.410406  5.055626 13.208885 12.245737  6.406130  6.757686  5.937388
#> [15]  4.062424  4.062424  5.742080  7.184310  6.740834  7.516917  4.767366
#> [22]  6.054574  5.937388  5.195215  7.499859  6.053446  5.765187  6.740834
#> [29]  6.171759 13.552866  5.859265  4.745192
fsd(mtcars$mpg, mtcars$cyl, mtcars$hp, "/") # Group-scaling using weighted group sds
#>  [1] 16.220111 16.220111  5.363617 16.529066  7.822800 13.980191  5.982141
#>  [8]  5.740011  5.363617 14.829816 13.748475  6.860638  7.237136  6.358640
#> [15]  4.350648  4.350648  6.149474  7.621982  7.151489  7.974852  5.057797
#> [22]  6.484139  6.358640  5.563810  8.031966  6.422226  6.116405  7.151489
#> [29]  6.609639 15.216009  6.274973  5.034272

## data.frame method
fsd(iris)                           # This works, although 'Species' is a factor variable
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#>    0.8280661    0.4358663    1.7652982    0.7622377    0.8192319 
fsd(mtcars, drop = FALSE)           # This works, all columns are numeric variables
#>        mpg      cyl     disp       hp      drat        wt     qsec        vs
#> 1 6.026948 1.785922 123.9387 68.56287 0.5346787 0.9784574 1.786943 0.5040161
#>          am      gear   carb
#> 1 0.4989909 0.7378041 1.6152
fsd(iris[-5], iris[5])              # By Species: iris[5] is still a list, and thus passed to GRP()
#>            Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa        0.3524897   0.3790644    0.1736640   0.1053856
#> versicolor    0.5161711   0.3137983    0.4699110   0.1977527
#> virginica     0.6358796   0.3224966    0.5518947   0.2746501
fsd(iris[-5], iris[[5]])            # Same thing much faster: fsd recognizes 'Species' is a factor
#>            Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa        0.3524897   0.3790644    0.1736640   0.1053856
#> versicolor    0.5161711   0.3137983    0.4699110   0.1977527
#> virginica     0.6358796   0.3224966    0.5518947   0.2746501
head(fsd(iris[-5], iris[[5]], TRA = "/")) # Data scaled by species (see also fscale)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     14.46851    9.233260     8.061544    1.897793
#> 2     13.90112    7.914223     8.061544    1.897793
#> 3     13.33372    8.441838     7.485720    1.897793
#> 4     13.05003    8.178031     8.637369    1.897793
#> 5     14.18481    9.497068     8.061544    1.897793
#> 6     15.31960   10.288490     9.789018    3.795585

## matrix method
m <- qM(mtcars)
fsd(m)
#>         mpg         cyl        disp          hp        drat          wt 
#>   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
#>        qsec          vs          am        gear        carb 
#>   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000 
fsd(m, mtcars$cyl) # etc..
#>        mpg cyl     disp       hp      drat        wt     qsec        vs
#> 4 4.509828   0 26.87159 20.93453 0.3654711 0.5695637 1.682445 0.3015113
#> 6 1.453567   0 41.56246 24.26049 0.4760552 0.3563455 1.706866 0.5345225
#> 8 2.560048   0 67.77132 50.97689 0.3723618 0.7594047 1.196014 0.0000000
#>          am      gear     carb
#> 4 0.4670994 0.5393599 0.522233
#> 6 0.5345225 0.6900656 1.812654
#> 8 0.3631365 0.7262730 1.556624
 
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% fsd()
#> # A tibble: 7 × 11
#>     cyl    vs    am    mpg  disp    hp   drat     wt    qsec   gear   carb
#>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#> 1     4     0     1 NA     NA    NA    NA     NA     NA      NA     NA    
#> 2     4     1     0  1.45  14.0  19.7   0.13   0.408  1.67    0.577  0.577
#> 3     4     1     1  4.76  18.8  24.1   0.378  0.440  0.945   0.378  0.535
#> 4     6     0     1  0.751  8.66 37.5   0.162  0.128  0.769   0.577  1.15 
#> 5     6     1     0  1.63  44.7   9.18  0.592  0.116  0.816   0.577  1.73 
#> 6     8     0     0  2.77  71.8  33.4   0.230  0.768  0.802   0      0.900
#> 7     8     0     1  0.566 35.4  50.2   0.481  0.283  0.0707  0      2.83 
mtcars %>% group_by(cyl,vs,am) %>% fsd(keep.group_vars = FALSE) # Remove grouping columns
#> # A tibble: 7 × 8
#>      mpg  disp    hp   drat     wt    qsec   gear   carb
#>    <dbl> <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#> 1 NA     NA    NA    NA     NA     NA      NA     NA    
#> 2  1.45  14.0  19.7   0.13   0.408  1.67    0.577  0.577
#> 3  4.76  18.8  24.1   0.378  0.440  0.945   0.378  0.535
#> 4  0.751  8.66 37.5   0.162  0.128  0.769   0.577  1.15 
#> 5  1.63  44.7   9.18  0.592  0.116  0.816   0.577  1.73 
#> 6  2.77  71.8  33.4   0.230  0.768  0.802   0      0.900
#> 7  0.566 35.4  50.2   0.481  0.283  0.0707  0      2.83 
mtcars %>% group_by(cyl,vs,am) %>% fsd(hp)      # Weighted by hp
#> # A tibble: 7 × 11
#>     cyl    vs    am sum.hp   mpg  disp  drat     wt   qsec  gear  carb
#>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>
#> 1     4     0     1     91 0      0    0     0      0      0     0    
#> 2     4     1     0    254 1.12  11.4  0.109 0.342  1.40   0.487 0.487
#> 3     4     1     1    564 4.56  17.9  0.312 0.447  0.934  0.401 0.500
#> 4     6     0     1    395 0.647  7.46 0.139 0.0959 0.651  0.497 0.995
#> 5     6     1     0    461 1.40  38.8  0.509 0.0989 0.701  0.499 1.50 
#> 6     8     0     0   2330 2.66  68.8  0.233 0.756  0.830  0     0.851
#> 7     8     0     1    599 0.398 24.8  0.338 0.199  0.0497 0     1.99 
mtcars %>% group_by(cyl,vs,am) %>% fsd(hp, "/") # Weighted scaling transformation
#> # A tibble: 32 × 11
#> # Groups:   cyl, vs, am [7]
#>      cyl    vs    am    hp   mpg  disp  drat    wt  qsec   gear  carb
#>  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
#>  1     6     0     1   110 32.5  21.4  28.0  27.3   25.3   8.04 4.02 
#>  2     6     0     1   110 32.5  21.4  28.0  30.0   26.1   8.04 4.02 
#>  3     4     1     1    93  5.00  6.05 12.3   5.19  19.9   9.98 2.00 
#>  4     6     1     0   110 15.3   6.65  6.05 32.5   27.7   6.01 0.667
#>  5     8     0     0   175  7.02  5.23 13.5   4.55  20.5 Inf    2.35 
#>  6     6     1     0   105 13.0   5.80  5.42 35.0   28.9   6.01 0.667
#>  7     8     0     0   245  5.37  5.23 13.8   4.73  19.1 Inf    4.70 
#>  8     4     1     0    62 21.7  12.8  34.0   9.34  14.3   8.22 4.11 
#>  9     4     1     0    95 20.3  12.3  36.1   9.22  16.3   8.22 4.11 
#> 10     6     1     0   123 13.8   4.32  7.69 34.8   26.1   8.01 2.67 
#> # … with 22 more rows