Fast (Weighted) Sample Quantiles and Range

A faster alternative to quantile (written fully in C), that supports sampling weights, and can also quickly compute quantiles from an ordering vector (e.g. order(x)). frange provides a fast alternative to range.

Usage

fquantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL,
          o = if(length(x) > 1e5L && length(probs) > log(length(x)))
              radixorder(x) else NULL,
          na.rm = .op[["na.rm"]], type = 7L, names = TRUE,
          check.o = is.null(attr(o, "sorted")))

# Programmers version: no names, intelligent defaults, or checks
.quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL, o = NULL,
          na.rm = TRUE, type = 7L, names = FALSE, check.o = FALSE)

# Fast range (min and max)
frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
.range(x, na.rm = TRUE, finite = FALSE)

Arguments

x: a numeric or integer vector.
probs: numeric vector of probabilities with values in [0,1].
w: a numeric vector of sampling weights. Missing weights are only supported if x is also missing.
o: integer. An vector giving the ordering of the elements in x, such that identical(x[o], sort(x)). If available this considerably speeds up the estimation.
na.rm: logical. Remove missing values, default TRUE.
finite: logical. Omit all non-finite values.
type: integer. Quantile types 5-9. See quantile. Further details are provided in Hyndman and Fan (1996) who recommended type 8. The default method is type 7.
names: logical. Generates names of the form paste0(round(probs * 100, 1), "%") (in C). Set to FALSE for speedup.
check.o: logical. If o is supplied, TRUE runs through o once and checks that it is valid, i.e. that each element is in [1, length(x)]. Set to FALSE for significant speedup if o is known to be valid.

Details

fquantile is implemented using a quickselect algorithm in C, inspired by data.table's gmedian. The algorithm is applied incrementally to different sections of the array to find individual quantiles. If many quantile probabilities are requested, sorting the whole array with the fast radixorder algorithm is more efficient. The default threshold for this (length(x) > 1e5L && length(probs) > log(length(x))) is conservative, given that quickselect is generally more efficient on longitudinal data with similar values repeated by groups. With random data, my investigations yield that a threshold of length(probs) > log10(length(x)) would be more appropriate.

Weighted quantile estimation, in a nutshell, is done by internally calling radixorder(x) (unless o is supplied), and summing the weights in order until the lowest required order statistic j is found, which corresponds to exceeding a target sum of weights that is a function of the probability p, the quantile method (see quantile), the total sum of weights, and the smallest (non-zero) weight. For quantile type 7 the target sum is sumwp = (sum(w) - min(w)) * p (resembling (n - 1) * p in the unweighted case). Then, a continuous index h in [0, 1] is determined as one minus the difference between the sum of weights associated with j and the target sum, divided by the weight of element j, that is h = 1 - (sumwj - sumwp) / w[j]. A weighted quantile can then be computed as a weighted average of 2 order statistics, exactly as in the unweighted case: WQ[i](p) = (1 - h) x[j] + h x[j+1]. If the order statistic j+1 has a zero weight, j+2 is taken (or j+3 if j+2 also has zero weight etc..). The Examples section provides a demonstration in R that is roughly equivalent to the algorithm just outlined.

frange is considerably more efficient than range, which calls both min and max, and thus requires 2 full passes instead of 1 required by frange. If only probabilities 0 and 1 are requested, fquantile internally calls frange.

Value

A vector of quantiles. If names = TRUE, fquantile generates names as paste0(round(probs * 100, 1), "%") (in C).

Examples

frange(mtcars$mpg)
#> [1] 10.4 33.9

## Checking computational equivalence to stats::quantile()
w = alloc(abs(rnorm(1)), 32)
o = radixorder(mtcars$mpg)
for (i in 5:9) print(all_obj_equal(fquantile(mtcars$mpg, type = i),
                                   fquantile(mtcars$mpg, type = i, w = w),
                                   fquantile(mtcars$mpg, type = i, o = o),
                                   fquantile(mtcars$mpg, type = i, w = w, o = o),
                                    quantile(mtcars$mpg, type = i)))
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE

## Demonstaration: weighted quantiles type 7 in R
wquantile7R <- function(x, w, probs = c(0.25, 0.5, 0.75), na.rm = TRUE, names = TRUE) {
  if(na.rm && anyNA(x)) {             # Removing missing values (only in x)
    cc = whichNA(x, invert = TRUE)    # The C code first calls radixorder(x), which places
    x = x[cc]; w = w[cc]              # missing values last, so removing = early termination
  }
  if(anyv(w, 0)) {                    # Removing zero weights
    nzw = whichv(w, 0, invert = TRUE) # In C, skipping zero weight order statistics is built
    x = x[nzw]; w = w[nzw]            # into the quantile algorithm, as outlined above
  }
  o = radixorder(x)                   # Ordering
  wo = w[o]
  w_cs = cumsum(wo)                   # Cumulative sum
  sumwp = sum(w)                      # Computing sum(w) - min(w)
  sumwp = sumwp - wo[1L]
  sumwp = sumwp * probs               # Target sums of weights for quantile type 7
  res = sapply(sumwp, function(tsump) {
    j = which.max(w_cs > tsump)           # Lower order statistic
    hl = (w_cs[j] - tsump) / wo[j]        # Index weight of x[j]  (h = 1 - hl)
    hl * x[o[j]] + (1 - hl) * x[o[j+1L]]  # Weighted quantile
  })
  if(names) names(res) = paste0(as.integer(probs * 100), "%")
  res
} # Note: doesn't work for min and max. Overall the C code is significantly more rigorous.

wquantile7R(mtcars$mpg, mtcars$wt)
#>      25%      50%      75% 
#> 15.05583 17.81282 21.21670 

all.equal(wquantile7R(mtcars$mpg, mtcars$wt),
          fquantile(mtcars$mpg, c(0.25, 0.5, 0.75), mtcars$wt))
#> [1] TRUE

## Efficient grouped quantile estimation: use .quantile for less call overhead
BY(mtcars$mpg, mtcars$cyl, .quantile, names = TRUE, expand.wide = TRUE)
#>     0%   25%  50%   75% 100%
#> 4 21.4 22.80 26.0 30.40 33.9
#> 6 17.8 18.65 19.7 21.00 21.4
#> 8 10.4 14.40 15.2 16.25 19.2
BY(mtcars, mtcars$cyl, .quantile, names = TRUE)
#>         mpg cyl   disp    hp  drat     wt  qsec vs  am gear carb
#> 4.0%   21.4   4  71.10  52.0 3.690 1.5130 16.70  0 0.0    3    1
#> 4.25%  22.8   4  78.85  65.5 3.810 1.8850 18.56  1 0.5    4    1
#> 4.50%  26.0   4 108.00  91.0 4.080 2.2000 18.90  1 1.0    4    2
#> 4.75%  30.4   4 120.65  96.0 4.165 2.6225 19.95  1 1.0    4    2
#> 4.100% 33.9   4 146.70 113.0 4.930 3.1900 22.90  1 1.0    5    2
#> 6.0%   17.8   6 145.00 105.0 2.760 2.6200 15.50  0 0.0    3    1
#>  [ reached 'max' / getOption("max.print") -- omitted 9 rows ]
library(magrittr)
mtcars |> fgroup_by(cyl) |> BY(.quantile)
#>   cyl  mpg   disp    hp  drat     wt  qsec vs  am gear carb
#> 1   4 21.4  71.10  52.0 3.690 1.5130 16.70  0 0.0    3    1
#> 2   4 22.8  78.85  65.5 3.810 1.8850 18.56  1 0.5    4    1
#> 3   4 26.0 108.00  91.0 4.080 2.2000 18.90  1 1.0    4    2
#> 4   4 30.4 120.65  96.0 4.165 2.6225 19.95  1 1.0    4    2
#> 5   4 33.9 146.70 113.0 4.930 3.1900 22.90  1 1.0    5    2
#> 6   6 17.8 145.00 105.0 2.760 2.6200 15.50  0 0.0    3    1
#>  [ reached 'max' / getOption("max.print") -- omitted 9 rows ]

## With weights
BY(mtcars$mpg, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE, expand.wide = TRUE)
#>     0%      25%      50%      75% 100%
#> 4 21.4 22.80000 24.63398 28.46510 33.9
#> 6 17.8 18.46720 19.53285 21.00000 21.4
#> 8 10.4 13.82363 15.10871 15.86062 19.2
BY(mtcars, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE)
#>             mpg cyl      disp        hp     drat       wt     qsec vs am gear
#> 4.0%   21.40000   4  71.10000  52.00000 3.690000 1.513000 16.70000  0  0    3
#> 4.25%  22.80000   4  80.47271  65.58692 3.765265 2.035063 18.60174  1  0    4
#> 4.50%  24.63398   4 120.11915  91.92430 3.995606 2.356062 19.23390  1  1    4
#> 4.75%  28.46510   4 131.38432  96.53079 4.105221 3.006192 20.00287  1  1    4
#> 4.100% 33.90000   4 146.70000 113.00000 4.930000 3.190000 22.90000  1  1    5
#> 6.0%   17.80000   6 145.00000 105.00000 2.760000 2.620000 15.50000  0  0    3
#>        carb
#> 4.0%      1
#> 4.25%     1
#> 4.50%     2
#> 4.75%     2
#> 4.100%    2
#> 6.0%      1
#>  [ reached 'max' / getOption("max.print") -- omitted 9 rows ]
mtcars |> fgroup_by(cyl) |> fselect(-wt) |> BY(.quantile, w = mtcars$wt)
#>   cyl      mpg      disp        hp     drat     qsec vs am     gear     carb
#> 1   4 21.40000  71.10000  52.00000 3.690000 16.70000  0  0 3.000000 1.000000
#> 2   4 22.80000  80.47271  65.58692 3.765265 18.60174  1  0 4.000000 1.000000
#> 3   4 24.63398 120.11915  91.92430 3.995606 19.23390  1  1 4.000000 2.000000
#> 4   4 28.46510 131.38432  96.53079 4.105221 20.00287  1  1 4.000000 2.000000
#> 5   4 33.90000 146.70000 113.00000 4.930000 22.90000  1  1 5.000000 2.000000
#> 6   6 17.80000 145.00000 105.00000 2.760000 15.50000  0  0 3.000000 1.000000
#> 7   6 18.46720 160.00000 110.00000 3.269798 16.88588  0  0 3.415101 2.245303
#>  [ reached 'max' / getOption("max.print") -- omitted 8 rows ]
mtcars |> fgroup_by(cyl) |> fsummarise(across(-wt, .quantile, w = wt))
#>   cyl      mpg      disp        hp     drat     qsec vs am     gear     carb
#> 1   4 21.40000  71.10000  52.00000 3.690000 16.70000  0  0 3.000000 1.000000
#> 2   4 22.80000  80.47271  65.58692 3.765265 18.60174  1  0 4.000000 1.000000
#> 3   4 24.63398 120.11915  91.92430 3.995606 19.23390  1  1 4.000000 2.000000
#> 4   4 28.46510 131.38432  96.53079 4.105221 20.00287  1  1 4.000000 2.000000
#> 5   4 33.90000 146.70000 113.00000 4.930000 22.90000  1  1 5.000000 2.000000
#> 6   6 17.80000 145.00000 105.00000 2.760000 15.50000  0  0 3.000000 1.000000
#> 7   6 18.46720 160.00000 110.00000 3.269798 16.88588  0  0 3.415101 2.245303
#>  [ reached 'max' / getOption("max.print") -- omitted 8 rows ]