fndistinct is a generic function that (column-wise) computes the number of distinct values in x, (optionally) grouped by g. It is significantly faster than length(unique(x)). The TRA argument can further be used to transform x using its (grouped) distinct value count.

fndistinct(x, ...)

# S3 method for default
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = TRUE, nthreads = .op[["nthreads"]], ...)

# S3 method for matrix
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)

# S3 method for data.frame
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)

# S3 method for grouped_df
fndistinct(x, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = FALSE, keep.group_vars = TRUE, nthreads = .op[["nthreads"]], ...)

Arguments

x

a vector, matrix, data frame or grouped data frame (class 'grouped_df').

g

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

TRA

an integer or quoted operator indicating the transformation to perform: 0 - "replace_NA" | 1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA.

na.rm

logical. TRUE: Skip missing values in x (faster computation). FALSE: Also consider 'NA' as one distinct value.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

nthreads

integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise.

drop

matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

...

arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.

Details

fndistinct implements a pretty fast C-level hashing algorithm inspired by the kit package to find the number of distinct values.

If na.rm = TRUE (the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = FALSE, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA) will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.

fndistinct preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.

Value

Integer. The number of distinct values in x, grouped by g, or (if TRA is used) x transformed by its distinct value count, grouped by g.

Examples

## default vector method
fndistinct(airquality$Solar.R)                   # Simple distinct value count
#> [1] 117
fndistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count
#>  5  6  7  8  9 
#> 27 28 29 27 27 

## data.frame method
fndistinct(airquality)
#>   Ozone Solar.R    Wind    Temp   Month     Day 
#>      67     117      31      40       5      31 
fndistinct(airquality, airquality$Month)
#>   Ozone Solar.R Wind Temp Month Day
#> 5    21      27   18   18     1  31
#> 6     9      28   16   19     1  30
#> 7    24      29   17   14     1  31
#> 8    24      27   18   19     1  31
#> 9    21      27   19   20     1  30
fndistinct(wlddev)                               # Works with data of all types!
#> country   iso3c    date    year  decade  region  income    OECD   PCGDP  LIFEEX 
#>     216     216      61      61       7       7       4       2    9470   10548 
#>    GINI     ODA     POP 
#>     368    7832   12877 
head(fndistinct(wlddev, wlddev$iso3c))
#>     country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA POP
#> ABW       1     1   61   61      7      1      1    1    32     60    0  20  60
#> AFG       1     1   61   61      7      1      1    1    18     60    0  60  60
#> AGO       1     1   61   61      7      1      1    1    40     59    3  58  60
#> ALB       1     1   61   61      7      1      1    1    40     59    9  32  60
#> AND       1     1   61   61      7      1      1    1    50      0    0   0  60
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

## matrix method
aqm <- qM(airquality)
fndistinct(aqm)                                  # Also works for character or logical matrices
#>   Ozone Solar.R    Wind    Temp   Month     Day 
#>      67     117      31      40       5      31 
fndistinct(aqm, airquality$Month)
#>   Ozone Solar.R Wind Temp Month Day
#> 5    21      27   18   18     1  31
#> 6     9      28   16   19     1  30
#> 7    24      29   17   14     1  31
#> 8    24      27   18   19     1  31
#> 9    21      27   19   20     1  30
 
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
airquality %>% group_by(Month) %>% fndistinct()
#> # A tibble: 5 × 6
#>   Month Ozone Solar.R  Wind  Temp   Day
#>   <int> <int>   <int> <int> <int> <int>
#> 1     5    21      27    18    18    31
#> 2     6     9      28    16    19    30
#> 3     7    24      29    17    14    31
#> 4     8    24      27    18    19    31
#> 5     9    21      27    19    20    30
wlddev %>% group_by(country) %>%
             select(PCGDP,LIFEEX,GINI,ODA) %>% fndistinct()
#> Adding missing grouping variables: `country`
#> # A tibble: 216 × 5
#>    country             PCGDP LIFEEX  GINI   ODA
#>    <chr>               <int>  <int> <int> <int>
#>  1 Afghanistan            18     60     0    60
#>  2 Albania                40     59     9    32
#>  3 Algeria                60     60     3    60
#>  4 American Samoa         17      0     0     0
#>  5 Andorra                50      0     0     0
#>  6 Angola                 40     59     3    58
#>  7 Antigua and Barbuda    43     60     0    47
#>  8 Argentina              60     60    29    60
#>  9 Armenia                30     59    20    29
#> 10 Aruba                  32     60     0    20
#> # … with 206 more rows