Fast (Grouped) Distinct Value Count for Matrix-Like Objects
fndistinct.Rd
fndistinct
is a generic function that (column-wise) computes the number of distinct values in x
, (optionally) grouped by g
. It is significantly faster than length(unique(x))
. The TRA
argument can further be used to transform x
using its (grouped) distinct value count.
Usage
fndistinct(x, ...)
# Default S3 method
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, nthreads = .op[["nthreads"]], ...)
# S3 method for class 'matrix'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
# S3 method for class 'data.frame'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
# S3 method for class 'grouped_df'
fndistinct(x, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, nthreads = .op[["nthreads"]], ...)
Arguments
- x
a vector, matrix, data frame or grouped data frame (class 'grouped_df').
- g
a factor,
GRP
object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to aGRP
object) used to groupx
.- TRA
an integer or quoted operator indicating the transformation to perform: 0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See
TRA
.- na.rm
logical.
TRUE
: Skip missing values inx
(faster computation).FALSE
: Also consider 'NA' as one distinct value.- use.g.names
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
- nthreads
integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise.
- drop
matrix and data.frame method: Logical.
TRUE
drops dimensions and returns an atomic vector ifg = NULL
andTRA = NULL
.- keep.group_vars
grouped_df method: Logical.
FALSE
removes grouping variables after computation.- ...
arguments to be passed to or from other methods. If
TRA
is used, passingset = TRUE
will transform data by reference and return the result invisibly.
Details
fndistinct
implements a pretty fast C-level hashing algorithm inspired by the kit package to find the number of distinct values.
If na.rm = TRUE
(the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = FALSE
, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA)
will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.
fndistinct
preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.
Value
Integer. The number of distinct values in x
, grouped by g
, or (if TRA
is used) x
transformed by its distinct value count, grouped by g
.
Examples
## default vector method
fndistinct(airquality$Solar.R) # Simple distinct value count
#> [1] 117
fndistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count
#> 5 6 7 8 9
#> 27 28 29 27 27
## data.frame method
fndistinct(airquality)
#> Ozone Solar.R Wind Temp Month Day
#> 67 117 31 40 5 31
fndistinct(airquality, airquality$Month)
#> Ozone Solar.R Wind Temp Month Day
#> 5 21 27 18 18 1 31
#> 6 9 28 16 19 1 30
#> 7 24 29 17 14 1 31
#> 8 24 27 18 19 1 31
#> 9 21 27 19 20 1 30
fndistinct(wlddev) # Works with data of all types!
#> country iso3c date year decade region income OECD PCGDP LIFEEX
#> 216 216 61 61 7 7 4 2 9470 10548
#> GINI ODA POP
#> 368 7832 12877
head(fndistinct(wlddev, wlddev$iso3c))
#> country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA POP
#> ABW 1 1 61 61 7 1 1 1 32 60 0 20 60
#> AFG 1 1 61 61 7 1 1 1 18 60 0 60 60
#> AGO 1 1 61 61 7 1 1 1 40 59 3 58 60
#> ALB 1 1 61 61 7 1 1 1 40 59 9 32 60
#> AND 1 1 61 61 7 1 1 1 50 0 0 0 60
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
## matrix method
aqm <- qM(airquality)
fndistinct(aqm) # Also works for character or logical matrices
#> Ozone Solar.R Wind Temp Month Day
#> 67 117 31 40 5 31
fndistinct(aqm, airquality$Month)
#> Ozone Solar.R Wind Temp Month Day
#> 5 21 27 18 18 1 31
#> 6 9 28 16 19 1 30
#> 7 24 29 17 14 1 31
#> 8 24 27 18 19 1 31
#> 9 21 27 19 20 1 30
## method for grouped data frames - created with dplyr::group_by or fgroup_by
airquality |> fgroup_by(Month) |> fndistinct()
#> Month Ozone Solar.R Wind Temp Day
#> 1 5 21 27 18 18 31
#> 2 6 9 28 16 19 30
#> 3 7 24 29 17 14 31
#> 4 8 24 27 18 19 31
#> 5 9 21 27 19 20 30
wlddev |> fgroup_by(country) |>
fselect(PCGDP,LIFEEX,GINI,ODA) |> fndistinct()
#> country PCGDP LIFEEX GINI ODA
#> 1 Afghanistan 18 60 0 60
#> 2 Albania 40 59 9 32
#> 3 Algeria 60 60 3 60
#> 4 American Samoa 17 0 0 0
#> 5 Andorra 50 0 0 0
#> 6 Angola 40 59 3 58
#> 7 Antigua and Barbuda 43 60 0 47
#> 8 Argentina 60 60 29 60
#> 9 Armenia 30 59 20 29
#> 10 Aruba 32 60 0 20
#> 11 Australia 60 59 9 0
#> 12 Austria 60 60 16 0
#> 13 Azerbaijan 30 60 5 29
#> 14 Bahamas, The 60 59 0 41
#> [ reached 'max' / getOption("max.print") -- omitted 202 rows ]