fNdistinct is a generic function that (column-wise) computes the number of distinct values in x, (optionally) grouped by g. It is significantly faster than length(unique(x)). The TRA argument can further be used to transform x using its (grouped) distinct value count.

fNdistinct(x, ...)

# S3 method for default
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
           use.g.names = TRUE, ...)

# S3 method for matrix
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
           use.g.names = TRUE, drop = TRUE, ...)

# S3 method for data.frame
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
           use.g.names = TRUE, drop = TRUE, ...)

# S3 method for grouped_df
fNdistinct(x, TRA = NULL, na.rm = TRUE,
           use.g.names = FALSE, keep.group_vars = TRUE, ...)

Arguments

x

a vector, matrix, data frame or grouped data frame (class 'grouped_df').

g

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

TRA

an integer or quoted operator indicating the transformation to perform: 1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA.

na.rm

logical. TRUE: Skip missing values in x (faster computation). FALSE: Also consider 'NA' as one distinct value.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

drop

matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

...

arguments to be passed to or from other methods.

Details

fNdistinct implements a fast algorithm to find the number of distinct values utilizing index- hashing implemented in the Rcpp::sugar::IndexHash class.

If na.rm = TRUE (the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = TRUE, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA) will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.

Grouped computations are performed by mapping the data to a sparse-array and then hash-mapping each group. This is often not much slower than using a larger hash-map for the entire data when g = NULL.

fNdistinct preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.

Value

Integer. The number of distinct values in x, grouped by g, or (if TRA is used) x transformed by its distinct value count, grouped by g.

See also

Examples

## default vector method fNdistinct(airquality$Solar.R) # Simple distinct value count
#> [1] 117
fNdistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count
#> 5 6 7 8 9 #> 27 28 29 27 27
## data.frame method fNdistinct(airquality)
#> Ozone Solar.R Wind Temp Month Day #> 67 117 31 40 5 31
fNdistinct(airquality, airquality$Month)
#> Ozone Solar.R Wind Temp Month Day #> 5 21 27 18 18 1 31 #> 6 9 28 16 19 1 30 #> 7 24 29 17 14 1 31 #> 8 24 27 18 19 1 31 #> 9 21 27 19 20 1 30
fNdistinct(wlddev) # Works with data of all types!
#> country iso3c date year decade region income OECD PCGDP LIFEEX #> 216 216 59 59 7 7 4 2 8995 10048 #> GINI ODA #> 363 7564
head(fNdistinct(wlddev, wlddev$iso3c))
#> country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA #> ABW 1 1 59 59 7 1 1 1 32 57 0 20 #> AFG 1 1 59 59 7 1 1 1 16 57 0 58 #> AGO 1 1 59 59 7 1 1 1 38 57 2 55 #> ALB 1 1 59 59 7 1 1 1 38 56 5 30 #> AND 1 1 59 59 7 1 1 1 48 0 0 0 #> ARE 1 1 59 59 7 1 1 1 43 57 0 43
## matrix method aqm <- qM(airquality) fNdistinct(aqm) # Also works for character or logical matrices
#> Ozone Solar.R Wind Temp Month Day #> 67 117 31 40 5 31
fNdistinct(aqm, airquality$Month)
#> Ozone Solar.R Wind Temp Month Day #> 5 21 27 18 18 1 31 #> 6 9 28 16 19 1 30 #> 7 24 29 17 14 1 31 #> 8 24 27 18 19 1 31 #> 9 21 27 19 20 1 30
## method for grouped data frames - created with dplyr::group_by or fgroup_by library(dplyr) airquality %>% group_by(Month) %>% fNdistinct
#> # A tibble: 5 x 6 #> Month Ozone Solar.R Wind Temp Day #> <int> <int> <int> <int> <int> <int> #> 1 5 21 27 18 18 31 #> 2 6 9 28 16 19 30 #> 3 7 24 29 17 14 31 #> 4 8 24 27 18 19 31 #> 5 9 21 27 19 20 30
wlddev %>% group_by(country) %>% select(PCGDP,LIFEEX,GINI,ODA) %>% fNdistinct
#> Adding missing grouping variables: `country`
#> # A tibble: 216 x 5 #> country PCGDP LIFEEX GINI ODA #> <chr> <int> <int> <int> <int> #> 1 Afghanistan 16 57 0 58 #> 2 Albania 38 56 5 30 #> 3 Algeria 58 57 3 58 #> 4 American Samoa 16 0 0 0 #> 5 Andorra 48 0 0 0 #> 6 Angola 38 57 2 55 #> 7 Antigua and Barbuda 41 57 0 45 #> 8 Argentina 58 57 27 58 #> 9 Armenia 28 56 17 27 #> 10 Aruba 32 57 0 20 #> # ... with 206 more rows