Fast (Grouped, Weighted) Statistical Mode for Matrix-Like Objects
fmode.Rd
fmode
is a generic function and returns the (column-wise) statistical mode i.e. the most frequent value of x
, (optionally) grouped by g
and/or weighted by w
.
The TRA
argument can further be used to transform x
using its (grouped, weighted) mode. Ties between multiple possible modes can be resolved by taking the minimum, maximum, (default) first or last occurring mode.
Usage
fmode(x, ...)
# Default S3 method
fmode(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ties = "first", nthreads = .op[["nthreads"]], ...)
# S3 method for class 'matrix'
fmode(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ties = "first", nthreads = .op[["nthreads"]], ...)
# S3 method for class 'data.frame'
fmode(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ties = "first", nthreads = .op[["nthreads"]], ...)
# S3 method for class 'grouped_df'
fmode(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
ties = "first", nthreads = .op[["nthreads"]], ...)
Arguments
- x
a vector, matrix, data frame or grouped data frame (class 'grouped_df').
- g
a factor,
GRP
object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to aGRP
object) used to groupx
.- w
a numeric vector of (non-negative) weights, may contain missing values.
- TRA
an integer or quoted operator indicating the transformation to perform: 0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See
TRA
.- na.rm
logical. Skip missing values in
x
. Defaults toTRUE
and implemented at very little computational cost. Ifna.rm = FALSE
,NA
is treated as any other value.- use.g.names
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
- ties
an integer or character string specifying the method to resolve ties between multiple possible modes i.e. multiple values with the maximum frequency or sum of weights:
Int. String Description 1 "first" take the first occurring mode. 2 "min" take the smallest of the possible modes. 3 "max" take the largest of the possible modes. 4 "last" take the last occurring mode. Note:
"min"/"max"
don't work with character data. See also Details.- nthreads
integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise.
- drop
matrix and data.frame method: Logical.
TRUE
drops dimensions and returns an atomic vector ifg = NULL
andTRA = NULL
.- keep.group_vars
grouped_df method: Logical.
FALSE
removes grouping variables after computation.- keep.w
grouped_df method: Logical. Retain
sum
of weighting variable after computation (if contained ingrouped_df
).- stub
character. If
keep.w = TRUE
andstub = TRUE
(default), the summed weights column is prefixed by"sum."
. Users can specify a different prefix through this argument, or set it toFALSE
to avoid prefixing.- ...
arguments to be passed to or from other methods. If
TRA
is used, passingset = TRUE
will transform data by reference and return the result invisibly.
Details
fmode
implements a pretty fast C-level hashing algorithm inspired by the kit package to find the statistical mode.
If na.rm = FALSE
, NA
is not removed but treated as any other value (i.e. its frequency is counted). If all values are NA
, NA
is always returned.
The weighted mode is computed by summing up the weights for all distinct values and choosing the value with the largest sum. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
It is possible that multiple values have the same mode (the maximum frequency or sum of weights). Typical cases are simply when all values are either all the same or all distinct. In such cases, the default option ties = "first"
returns the first occurring value in the data reaching the maximum frequency count or sum of weights. For example in a sample x = c(1, 3, 2, 2, 4, 4, 1, 7)
, the first mode is 2 as fmode
goes through the data from left to right. ties = "last"
on the other hand gives 1. It is also possible to take the minimum or maximum mode, i.e. fmode(x, ties = "min")
returns 1, and fmode(x, ties = "max")
returns 4. It should be noted that options ties = "min"
and ties = "max"
give unintuitive results for character data (no strict alphabetic sorting, similar to using <
and >
to compare character values in R). These options are also best avoided if missing values are counted (na.rm = FALSE
) since no proper logical comparison with missing values is possible: With numeric data it depends, since in C++ any comparison with NA_real_
evaluates to FALSE
, NA_real_
is chosen as the min or max mode only if it is also the first mode, and never otherwise. For integer data, NA_integer_
is stored as the smallest integer in C++, so it will always be chosen as the min mode and never as the max mode. For character data, NA_character_
is stored as the string "NA"
in C++ and thus the behavior depends on the other character content.
fmode
preserves all the attributes of the objects it is applied to (apart from names or row-names which are adjusted as necessary in grouped operations). If a data frame is passed to fmode
and drop = TRUE
(the default), unlist
will be called on the result, which might not be sensible depending on the data at hand.
Value
The (w
weighted) statistical mode of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighed) mode.
Examples
x <- c(1, 3, 2, 2, 4, 4, 1, 7, NA, NA, NA)
fmode(x) # Default is ties = "first"
#> [1] 2
fmode(x, ties = "last")
#> [1] 1
fmode(x, ties = "min")
#> [1] 1
fmode(x, ties = "max")
#> [1] 4
fmode(x, na.rm = FALSE) # Here NA is the mode, regardless of ties option
#> [1] NA
fmode(x[-length(x)], na.rm = FALSE) # Not anymore..
#> [1] 2
## World Development Data
attach(wlddev)
## default vector method
fmode(PCGDP) # Numeric mode
#> [1] 330.3036
#> attr(,"label")
#> [1] "GDP per capita (constant 2010 US$)"
head(fmode(PCGDP, iso3c)) # Grouped numeric mode
#> ABW AFG AGO ALB AND ARE
#> 15669.6160 330.3036 3193.4039 1992.2919 41701.5444 103604.9068
head(fmode(PCGDP, iso3c, LIFEEX)) # Grouped and weighted numeric mode
#> ABW AFG AGO ALB AND ARE
#> 26630.2053 573.2876 3111.1577 5210.6883 NA 41420.4830
fmode(region) # Factor mode
#> [1] Europe & Central Asia
#> attr(,"label")
#> [1] Region
#> 7 Levels: East Asia & Pacific ... Sub-Saharan Africa
fmode(date) # Date mode (defaults to first value since panel is balanced)
#> [1] "1961-01-01"
fmode(country) # Character mode (also defaults to first value)
#> [1] "Afghanistan"
#> attr(,"label")
#> [1] "Country Name"
fmode(OECD) # Logical mode
#> [1] FALSE
#> attr(,"label")
#> [1] "Is OECD Member Country?"
# ..all the above can also be performed grouped and weighted
## matrix method
m <- qM(airquality)
fmode(m)
#> Ozone Solar.R Wind Temp Month Day
#> 23.0 259.0 11.5 81.0 5.0 1.0
fmode(m, na.rm = FALSE) # NA frequency is also counted
#> Ozone Solar.R Wind Temp Month Day
#> NA NA 11.5 81.0 5.0 1.0
fmode(m, airquality$Month) # Groupwise
#> Ozone Solar.R Wind Temp Month Day
#> 5 11 190 9.7 66 5 1
#> 6 29 250 11.5 76 6 1
#> 7 97 175 7.4 81 7 1
#> 8 44 255 11.5 86 8 1
#> 9 13 238 10.3 71 9 1
fmode(m, w = airquality$Day) # Weighted: Later days in the month are given more weight
#> Ozone Solar.R Wind Temp Month Day
#> 23.0 223.0 11.5 76.0 5.0 30.0
fmode(m>50, airquality$Month) # Groupwise logical mode
#> Ozone Solar.R Wind Temp Month Day
#> 5 FALSE TRUE FALSE TRUE FALSE FALSE
#> 6 FALSE TRUE FALSE TRUE FALSE FALSE
#> 7 TRUE TRUE FALSE TRUE FALSE FALSE
#> 8 FALSE TRUE FALSE TRUE FALSE FALSE
#> 9 FALSE TRUE FALSE TRUE FALSE FALSE
# etc..
## data.frame method
fmode(wlddev) # Calling unlist -> coerce to character vector
#> country iso3c date year
#> "Afghanistan" "2" "-3287" "1960"
#> decade region income OECD
#> "1960" "2" "1" "FALSE"
#> PCGDP LIFEEX GINI ODA
#> "330.303552908057" "62.869" "26.8" "70000.0002980232"
#> POP
#> "61786"
fmode(wlddev, drop = FALSE) # Gives one row
#> country iso3c date year decade region income
#> 1 Afghanistan AFG 1961-01-01 1960 1960 Europe & Central Asia High income
#> OECD PCGDP LIFEEX GINI ODA POP
#> 1 FALSE 330.3036 62.869 26.8 70000 61786
head(fmode(wlddev, iso3c)) # Grouped mode
#> country iso3c date year decade region
#> ABW Aruba ABW 1961-01-01 1960 1960 Latin America & Caribbean
#> AFG Afghanistan AFG 1961-01-01 1960 1960 South Asia
#> AGO Angola AGO 1961-01-01 1960 1960 Sub-Saharan Africa
#> ALB Albania ALB 1961-01-01 1960 1960 Europe & Central Asia
#> AND Andorra AND 1961-01-01 1960 1960 Europe & Central Asia
#> income OECD PCGDP LIFEEX GINI ODA POP
#> ABW High income FALSE 15669.6160 65.662 NA 36860001 54211
#> AFG Low income FALSE 330.3036 32.446 NA 116769997 8996973
#> AGO Lower middle income FALSE 3193.4039 45.201 52 -390000 5454933
#> ALB Upper middle income FALSE 1992.2919 71.860 27 9310000 1608800
#> AND High income FALSE 41701.5444 NA NA NA 13411
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
head(fmode(wlddev, iso3c, LIFEEX)) # Grouped and weighted mode
#> country iso3c date year decade region
#> ABW Aruba ABW 2020-01-01 2019 2010 Latin America & Caribbean
#> AFG Afghanistan AFG 2020-01-01 2019 2010 South Asia
#> AGO Angola AGO 2020-01-01 2019 2010 Sub-Saharan Africa
#> ALB Albania ALB 2020-01-01 2019 2010 Europe & Central Asia
#> AND Andorra AND 2021-01-01 2020 2020 Europe & Central Asia
#> income OECD PCGDP LIFEEX GINI ODA POP
#> ABW High income FALSE 26630.2053 76.293 NA -12840000 106314
#> AFG Low income FALSE 573.2876 64.833 NA 4339979980 38041754
#> AGO Lower middle income FALSE 3111.1577 45.201 51.3 49230000 31825295
#> ALB Upper middle income FALSE 5210.6883 71.860 33.2 31410000 2854191
#> AND High income FALSE NA NA NA NA NA
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
detach(wlddev)